Closed Bug 1155871 Opened 10 years ago Closed 10 years ago

Using v4 data, create csv rollup that matches the one used by executive dashboard

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kparlante, Assigned: trink)

References

Details

Here is the live data: https://metrics.services.mozilla.com/firefox-weekly-dashboard/data/firefox_monthly_data.csv https://metrics.services.mozilla.com/firefox-weekly-dashboard/data/firefox_weekly_data.csv The pipeline to get there: FHR v2 weekly deorphaned data 1% sample on HDFS at hdfs:///user/sguha/fhr/samples/output/1pct/ source of 1% file can be found in hdfs://user/sguha/fhr/samples/output/createdTime.txt Loads the vertica tables: https://github.com/mozilla/fhr-r-rollups Generates the csv files from the vertica tables: https://github.com/mozilla/exec-dashboard-data-pulling-scripts/blob/master/pull_firefox_data.py Vertica table schema in this bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1136012 Dashboard source: https://github.com/mozilla/firefox-weekly-dashboard/ Dashboard itself: https://metrics.mozilla.com/protected/dashboards/firefox/
Status: NEW → ASSIGNED
Priority: -- → P1
1121013 - blocks crashes 1159245 - blocks actives and inactives Created a decoder to create an output suitable for the executive reports: https://github.com/mozilla-services/data-pipeline/blob/executive_summary/heka/sandbox/decoders/extract_executive_summary.lua
Depends on: 1121013, 1159245
It is unclear how to derive the default search counts from the v4 data, please specify. I am seeing data like this: "SEARCH_COUNTS":{"google.urlbar":{"range":[1,2],"bucket_count":3,"histogram_type":4,"values":{"0":1,"1":0},"sum":1,"sum_squares_lo":1,"sum_squares_hi":0}}, "SEARCH_DEFAULT_ENGINE":{"google":{"range":[1,2],"bucket_count":3,"histogram_type":3,"values":"0":0,"1":1,"2":0},"sum":1,"sum_squares_lo":1,"sum_squares_hi":0}} The search count for 'google' is 1 (SEARCH_COUNTS["google.urlbar"]["sum"]). Since the default search engine is set to google would you want the 'default' column to also be incremented by 1? AND "SEARCH_COUNTS":{}, "SEARCH_DEFAULT_ENGINE":{"other-WebWebWeb - by Video Downloader Professional":{"range":[1,2],"bucket_count":3,"histogram_type":3,"values":"0":0,"1":1,"2":0},"sum":1,"sum_squares_lo":1,"sum_squares_hi":0}} The search count for 'other' and 'default' would be 0 in this case since no searches were performed.
Flags: needinfo?(sguha)
What is the specification of the "default" column? My naive understanding is that it tracks whether Firefox is the default browser, and has nothing to do with searches. Bug 1136012 is definitely not a specification.
Correction 1159245 - blocks five_of_seven and inactives. However, there is no way we will be able to calculate the ~30K partitions of these metrics in a single pass (the cuckoo filters would be too big). Although many partitions will be very small there is no way a cuckoo filter can grow to accommodate new values. We will need to select the most critical partitions (and either distribute the work or run multiple passes over the data). Also these calculation are based on the entire data set not a 1% sample; inactive and five_of seven should be at least 99.99% accurate and active is fixed at ~99% and everything else is an exact count. I propose just starting with five_of_seven and inactives for the 'All' partition. Is this acceptable?
Flags: needinfo?(kparlante)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #3) > What is the specification of the "default" column? My naive understanding is > that it tracks whether Firefox is the default browser, and has nothing to do > with searches. Bug 1136012 is definitely not a specification. Thanks for the pointer, so it looks like it relates to: https://github.com/mozilla/fhr-r-rollups/blob/7589d13964d05fb1578fb5f495248998e2e85e52/makeFlatTables.v3.R#L154
Flags: needinfo?(sguha)
That code referenced in comment 5, infers profile default status (not search default) during the timespan 1. If the profile was active, the profile is default=yes if more than 50% of the sessions the profile was default=yes 2. If the profile was not active during the time period, use the 'last value carried forward' assuming they have some history 3. if the profile was not active during the time period and has no previous history - default = no .
(In reply to "Saptarshi Guha[:joy]" from comment #6) I think we need to redefine this since there can be multiple session pings per day. Just counting the total number over the timespan and seeing if it is more than half probably won't cut it. Does the last ping of the day determine its value for the entire day? What this really comes down to is how do we stitch the pings back together. Day = Flag in sub session ping Mon = T, T, F Tue = F Wed = F, T, T, T, T 6/9 pings were true (but the distribution isn't even, so the totals don't mean much) 1/3 of the days were true (if we count Mon as false) The stitching if further complicated if the user switched buckets during the time frame (say the country changed).
Could you elaborate on why a more complicated scheme is required? The idea is to find out what their status is. We are not trying to pick up micro changes ( people dont switch default status multiples a week leave) FHR v2 was recorded once per day and our time granularities are the weekly level, so more than half works for us. If people switched for a day and then got back to their previous status, we were not really interested in that wrt a weekly granularity. If we now have session level granularity, then I would suggest 6/9 pings. But the more than half is a rule that determines what bucket(default = 1/0) one falls into. There could be other rules (e.g. more than 80% or value with longest run length). Maybe compute a contingency table of the new rule wrt existing rule to see what is classification difference
Personally I don't see the usefulness of a 'mostly default' calculation (but it is not my data). The data pipeline team is here to help facilitate the analysis of the data by getting it to the people that need it and supporting the underlying tooling. In the past we have helped teams bootstrap their analysis, I am assuming that is why this bug landed here. However, without a specification on how you would like us to deal with your new data format and how it should be analysed this task will progress very slowly.
(In reply to Mike Trinkala [:trink] from comment #9) > Personally I don't see the usefulness of a 'mostly default' calculation (but > it is not my data). The data pipeline team is here to help facilitate the > analysis of the data by getting it to the people that need it and supporting > the underlying tooling. In the past we have helped teams bootstrap their > analysis, I am assuming that is why this bug landed here. However, without > a specification on how you would like us to deal with your new data format > and how it should be analysed this task will progress very slowly. I created this bug, and am responsible for it landing here. The rollup table schema that is needed for the executive dashboard has already been defined for v2 (if not as well documented as it could be). How to generate that table from v4 data is not yet well defined and part of the work ahead of us (where "us" is the extended team) for this project. My intent in creating the bug and assigning it to you in the absence of that definition is to have a concrete working example of the approach we're proposing and to start flushing out the problem areas. Sounds like the next step should be to work on that specification; to be more efficient than asking questions field by field in this bug. cc'ing spenrose, who may help us out here.
Flags: needinfo?(kparlante)
(In reply to Mike Trinkala [:trink] from comment #4) > Correction 1159245 - blocks five_of_seven and inactives. > > However, there is no way we will be able to calculate the ~30K partitions of > these metrics in a single pass (the cuckoo filters would be too big). > Although many partitions will be very small there is no way a cuckoo filter > can grow to accommodate new values. We will need to select the most > critical partitions (and either distribute the work or run multiple passes > over the data). Also these calculation are based on the entire data set not > a 1% sample; inactive and five_of seven should be at least 99.99% accurate > and active is fixed at ~99% and everything else is an exact count. I > propose just starting with five_of_seven and inactives for the 'All' > partition. Is this acceptable? Yes, that sounds like a reasonable next step.
I reworked the cuckoo filter and analysis. It is now able to handle up to 256 countries, 4 operating systems, and 8 channels along with the new/inactive/total/default/five_of_seven metrics. Running a one year report (weekly rollups) with 73MM users took 40 minutes on my x230 laptop. As the number of users increases the weekly summation takes longer (it was up to 28 seconds by the end, adding about 0.5 seconds for every 1.4MM users). The peak working memory was 1.6GiB. I am still hoping to generalize this module but I am doubtful the performance will be acceptable.
No longer depends on: 1159245
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
r+
Flags: needinfo?(mreid)
See Also: → 1174912
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.