Closed Bug 1450329 Opened 7 years ago Closed 7 years ago

Investigate adding client_count HLL to search_aggregates

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: harter, Assigned: wlach)

References

Details

Attachments

(2 files)

YOY comparison.xlsx 7 years ago arana 9.99 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet		Details
Link to GitHub pull-request: https://github.com/mozilla/python_mozetl/pull/231 7 years ago GitHub Bugzilla PR Linker 49 bytes, text/x-github-pull-request		Details \| Review

Ryan Harter [:harter]

Reporter

Description

•

7 years ago

BD is interested in client adjusted metrics (like searches per user). We can calculate these types of metrics in search_clients_daily, but these queries take a long while to run due to the size of that dataset. Instead, we could add a client_count HLL column to search_aggregates. This would make client adjusted metrics easy to calculate.

Ryan Harter [:harter]

Reporter

Comment 1

•

7 years ago

Will - do you want to take this bug?

Flags: needinfo?(wlachance)

William Lachance (:wlach)

Assignee

Comment 2

•

7 years ago

(In reply to Ryan Harter [:harter] from comment #1) > Will - do you want to take this bug? I'm interested, when does it need to be done? ASAP?

Flags: needinfo?(wlachance) → needinfo?(rharter)

Ryan Harter [:harter]

Reporter

Comment 3

•

7 years ago

Q2 would be good, early May would be great. search_clients_daily is sufficing for now, but I expect the HLL would make some upcoming dashboard work much easier.

Flags: needinfo?(rharter)

William Lachance (:wlach)

Assignee

Comment 4

•

7 years ago

Early May should be totally doable, I'm trying to wrap up some mission control work before I leave for a week in mid-April, if I can get to this before then I will.

Assignee: nobody → wlachance

arana

Comment 5

•

7 years ago

Hi Ryan, Will, Thanks for taking this up. Is it possible to add a default_search_engine (not default_search_engine_data_name) columns as well in addition to the HLL.

Ryan Harter [:harter]

Reporter

Comment 6

•

7 years ago

Do you mind filing a separate bug for that work, arana? Thanks!

arana

Comment 7

•

7 years ago

Added this to bug Bug 1451907. Hope this is good

William Lachance (:wlach)

Assignee

Comment 8

•

7 years ago

Sorry for the delay on this, I've started work on this. Seems pretty straightforward with the use of approx_count_distinct.

arana

Comment 9

•

7 years ago

Hi William, approx_count_distinct does not work very well when it comes to performing an Year on Year analysis. The approx client count shows an acceptable error from the actual count but the when you do an YOY analysis on this data , the small error throws off the YOY %age increase/decrease since the %age changes are small (single digit). Please have a look at the attached sheet (YOY comparison.xlsx). As you can see, the Feb 2018 YOY drop has significantly changed. Kindly advice on this.

Flags: needinfo?(wlachance)

arana

Comment 10

•

7 years ago

Attached file YOY comparison.xlsx — Details

William Lachance (:wlach)

Assignee

Comment 11

•

7 years ago

(In reply to arana from comment #9) > Hi William, > > approx_count_distinct does not work very well when it comes to performing an > Year on Year analysis. The approx client count shows an acceptable error > from the actual count but the when you do an YOY analysis on this data , the > small error throws off the YOY %age increase/decrease since the %age changes > are small (single digit). Please have a look at the attached sheet (YOY > comparison.xlsx). As you can see, the Feb 2018 YOY drop has significantly > changed. Kindly advice on this. I don't know if there's any reasonable alternative to using approx_count_distinct/hll for this task. It's known to be imprecise but getting an exact number is impossible given the memory requirements and the size of our dataset. Ryan, do you have any thoughts on this or alternative approaches? Is this even worth doing (I'm guessing a known imprecise number is still better than no number)?

Flags: needinfo?(wlachance) → needinfo?(rharter)

Frank Bertsch [:frank]

Comment 12

•

7 years ago

A few comments: 1. This can't be done until bug 1305087 is complete, since this is Python code. 2. The precision of HLLs is configurable by setting the number of bits. Currently we set it at 12 (this gives 2.3% error), we can increase it for this dataset - but we would need to test that the Presto HLL reader is configurable as well. 3. Since this is just another client_count dataset at the end of the day, if we fix bug 1405066 we could consider running this from the existing GenericCountView [0], and then not have to rely on bug 1305087. [0] https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/GenericCountView.scala

William Lachance (:wlach)

Assignee

Comment 13

•

7 years ago

Frank, are you sure you're thinking of approx_count_distinct? The bug you're referencing is referring to some seperate package called spark-hyperloglog, while approx_count_distinct is a standard part of pyspark: http://spark.apache.org/docs/latest/api/python/_modules/pyspark/rdd.html#RDD.countApproxDistinct

Frank Bertsch [:frank]

Comment 14

•

7 years ago

(In reply to William Lachance (:wlach) (use needinfo!) from comment #13) > Frank, are you sure you're thinking of approx_count_distinct? The bug you're > referencing is referring to some seperate package called spark-hyperloglog, > while approx_count_distinct is a standard part of pyspark: > > http://spark.apache.org/docs/latest/api/python/_modules/pyspark/rdd.html#RDD. > countApproxDistinct So maybe approxDistinct will support your use-case, but the problem is it only gives one number for each dimension. When you try to combine dimensions, you lose information on how many clients there are, since a single client can show up in multiple dimensions. If there is no need for aggregation in this dataset, then you can use the single numbers. Otherwise they aren't sufficient, and that's why we don't use them for client_count datasets.

Ryan Harter [:harter]

Reporter

Comment 15

•

7 years ago

Frank appears to have covered most of this, but here's my two cents for the record. (In reply to William Lachance (:wlach) (use needinfo!) from comment #11) > Ryan, do you have any thoughts on this or alternative > approaches? Is this even worth doing (I'm guessing a known imprecise number > is still better than no number)? Amit's correct in that HLL does have some disadvantages for trend lines and YoY analyses [1]. An exact count would be nice, but, as Frank notes, exact counts can't be aggregated. I don't think we'll be able to get these error bounds much tighter than a ~1.5% standard error. We won't be able to use this for YoY analysis, but the HLL will still be useful. (In reply to Frank Bertsch [:frank] from comment #14) > So maybe approxDistinct will support your use-case, but the problem is it > only gives one number for each dimension. When you try to combine > dimensions, you lose information on how many clients there are, since a > single client can show up in multiple dimensions. > > If there is no need for aggregation in this dataset, then you can use the > single numbers. Otherwise they aren't sufficient, and that's why we don't > use them for client_count datasets. We will definitely need aggregation. An HLL is the right tool for this job. (In reply to Frank Bertsch [:frank] from comment #12) > A few comments: > > 1. This can't be done until bug 1305087 is complete, since this is Python > code. Thanks for identifying Bug 1305087, setting that as a blocker. > 3. Since this is just another client_count dataset at the end of the day, if > we fix bug 1405066 we could consider running this from the existing > GenericCountView [0], and then not have to rely on bug 1305087. I'm fine with this, but porting the search_aggregates dataset to scala seems like more of a burden than figuring out how to register the scala function as a UDF. [1] https://blog.harterrt.com/hll_trends.html

Depends on: 1305087

Flags: needinfo?(rharter)

William Lachance (:wlach)

Assignee

Comment 16

•

7 years ago

Apologies for the delay on this, I've been rather swamped with mission control work. Now that amiyaguchi has fixed bug 1305087, this should be unblocked now. I will try to get this done early next week.

GitHub Bugzilla PR Linker

Comment 17

•

7 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/python_mozetl/pull/231 — Details

William Lachance (:wlach)

Assignee

Comment 18

•

7 years ago

I have a tentative solution for this, but am blocked on testing due to bug 1466936

Depends on: 1466936

William Lachance (:wlach)

Assignee

Comment 19

•

7 years ago

This is now reviewed and ready to land along with some other changes here: https://github.com/mozilla/python_mozetl/pull/246 Thanks to Marina Samuel (:emtwo) for help testing this change.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

5 years ago

Component: Datasets: Search → Datasets: General

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: Datasets: General → General

You need to log in before you can comment on or make changes to this bug.