1155871 - Using v4 data, create csv rollup that matches the one used by executive dashboard

It is unclear how to derive the default search counts from the v4 data, please specify. I am seeing data like this: "SEARCH_COUNTS":{"google.urlbar":{"range":[1,2],"bucket_count":3,"histogram_type":4,"values":{"0":1,"1":0},"sum":1,"sum_squares_lo":1,"sum_squares_hi":0}}, "SEARCH_DEFAULT_ENGINE":{"google":{"range":[1,2],"bucket_count":3,"histogram_type":3,"values":"0":0,"1":1,"2":0},"sum":1,"sum_squares_lo":1,"sum_squares_hi":0}} The search count for 'google' is 1 (SEARCH_COUNTS["google.urlbar"]["sum"]). Since the default search engine is set to google would you want the 'default' column to also be incremented by 1? AND "SEARCH_COUNTS":{}, "SEARCH_DEFAULT_ENGINE":{"other-WebWebWeb - by Video Downloader Professional":{"range":[1,2],"bucket_count":3,"histogram_type":3,"values":"0":0,"1":1,"2":0},"sum":1,"sum_squares_lo":1,"sum_squares_hi":0}} The search count for 'other' and 'default' would be 0 in this case since no searches were performed.

Flags: needinfo?(sguha)

Benjamin Smedberg

Comment 3

•

10 years ago

What is the specification of the "default" column? My naive understanding is that it tracks whether Firefox is the default browser, and has nothing to do with searches. Bug 1136012 is definitely not a specification.

Mike Trinkala [:trink]

Comment 4

•

10 years ago

Correction 1159245 - blocks five_of_seven and inactives. However, there is no way we will be able to calculate the ~30K partitions of these metrics in a single pass (the cuckoo filters would be too big). Although many partitions will be very small there is no way a cuckoo filter can grow to accommodate new values. We will need to select the most critical partitions (and either distribute the work or run multiple passes over the data). Also these calculation are based on the entire data set not a 1% sample; inactive and five_of seven should be at least 99.99% accurate and active is fixed at ~99% and everything else is an exact count. I propose just starting with five_of_seven and inactives for the 'All' partition. Is this acceptable?

Flags: needinfo?(kparlante)

Mike Trinkala [:trink]

Comment 5

•

10 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #3) > What is the specification of the "default" column? My naive understanding is > that it tracks whether Firefox is the default browser, and has nothing to do > with searches. Bug 1136012 is definitely not a specification. Thanks for the pointer, so it looks like it relates to: https://github.com/mozilla/fhr-r-rollups/blob/7589d13964d05fb1578fb5f495248998e2e85e52/makeFlatTables.v3.R#L154

Flags: needinfo?(sguha)

"Saptarshi Guha[:joy]"

Comment 6

•

10 years ago

That code referenced in comment 5, infers profile default status (not search default) during the timespan 1. If the profile was active, the profile is default=yes if more than 50% of the sessions the profile was default=yes 2. If the profile was not active during the time period, use the 'last value carried forward' assuming they have some history 3. if the profile was not active during the time period and has no previous history - default = no .

Mike Trinkala [:trink]

Comment 7

•

10 years ago

(In reply to "Saptarshi Guha[:joy]" from comment #6) I think we need to redefine this since there can be multiple session pings per day. Just counting the total number over the timespan and seeing if it is more than half probably won't cut it. Does the last ping of the day determine its value for the entire day? What this really comes down to is how do we stitch the pings back together. Day = Flag in sub session ping Mon = T, T, F Tue = F Wed = F, T, T, T, T 6/9 pings were true (but the distribution isn't even, so the totals don't mean much) 1/3 of the days were true (if we count Mon as false) The stitching if further complicated if the user switched buckets during the time frame (say the country changed).

"Saptarshi Guha[:joy]"

Comment 8

•

10 years ago

Could you elaborate on why a more complicated scheme is required? The idea is to find out what their status is. We are not trying to pick up micro changes ( people dont switch default status multiples a week leave) FHR v2 was recorded once per day and our time granularities are the weekly level, so more than half works for us. If people switched for a day and then got back to their previous status, we were not really interested in that wrt a weekly granularity. If we now have session level granularity, then I would suggest 6/9 pings. But the more than half is a rule that determines what bucket(default = 1/0) one falls into. There could be other rules (e.g. more than 80% or value with longest run length). Maybe compute a contingency table of the new rule wrt existing rule to see what is classification difference

Mike Trinkala [:trink]

Comment 9

•

10 years ago

Personally I don't see the usefulness of a 'mostly default' calculation (but it is not my data). The data pipeline team is here to help facilitate the analysis of the data by getting it to the people that need it and supporting the underlying tooling. In the past we have helped teams bootstrap their analysis, I am assuming that is why this bug landed here. However, without a specification on how you would like us to deal with your new data format and how it should be analysed this task will progress very slowly.

Katie Parlante

Reporter

Comment 10

•

10 years ago

(In reply to Mike Trinkala [:trink] from comment #9) > Personally I don't see the usefulness of a 'mostly default' calculation (but > it is not my data). The data pipeline team is here to help facilitate the > analysis of the data by getting it to the people that need it and supporting > the underlying tooling. In the past we have helped teams bootstrap their > analysis, I am assuming that is why this bug landed here. However, without > a specification on how you would like us to deal with your new data format > and how it should be analysed this task will progress very slowly. I created this bug, and am responsible for it landing here. The rollup table schema that is needed for the executive dashboard has already been defined for v2 (if not as well documented as it could be). How to generate that table from v4 data is not yet well defined and part of the work ahead of us (where "us" is the extended team) for this project. My intent in creating the bug and assigning it to you in the absence of that definition is to have a concrete working example of the approach we're proposing and to start flushing out the problem areas. Sounds like the next step should be to work on that specification; to be more efficient than asking questions field by field in this bug. cc'ing spenrose, who may help us out here.

Flags: needinfo?(kparlante)

Katie Parlante

Reporter

Comment 11

•

10 years ago

(In reply to Mike Trinkala [:trink] from comment #4) > Correction 1159245 - blocks five_of_seven and inactives. > > However, there is no way we will be able to calculate the ~30K partitions of > these metrics in a single pass (the cuckoo filters would be too big). > Although many partitions will be very small there is no way a cuckoo filter > can grow to accommodate new values. We will need to select the most > critical partitions (and either distribute the work or run multiple passes > over the data). Also these calculation are based on the entire data set not > a 1% sample; inactive and five_of seven should be at least 99.99% accurate > and active is fixed at ~99% and everything else is an exact count. I > propose just starting with five_of_seven and inactives for the 'All' > partition. Is this acceptable? Yes, that sounds like a reasonable next step.

Mike Trinkala [:trink]

Comment 12

•

10 years ago

I reworked the cuckoo filter and analysis. It is now able to handle up to 256 countries, 4 operating systems, and 8 channels along with the new/inactive/total/default/five_of_seven metrics. Running a one year report (weekly rollups) with 73MM users took 40 minutes on my x230 laptop. As the number of users increases the weekly summation takes longer (it was up to 28 seconds by the end, adding about 0.5 seconds for every 1.4MM users). The peak working memory was 1.6GiB. I am still hoping to generalize this module but I am doubtful the performance will be acceptable.

Mike Trinkala [:trink]

Comment 13

•

10 years ago

https://github.com/mozilla-services/data-pipeline/pull/68

Flags: needinfo?(mreid)

Mike Trinkala [:trink]

Updated

•

10 years ago

No longer depends on: 1159245

Katie Parlante

Reporter

Updated

•

10 years ago

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Mark Reid [:mreid]

Comment 14

•

10 years ago

r+

Flags: needinfo?(mreid)

Katie Parlante

Reporter

Updated

•

10 years ago

Updated

•

7 years ago

Product: Cloud Services → Cloud Services Graveyard

Bugzilla

Using v4 data, create csv rollup that matches the one used by executive dashboard

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: kparlante, Assigned: trink)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Updated

Comment 14

Updated

Updated