Closed
Bug 1450329
Opened 7 years ago
Closed 7 years ago
Investigate adding client_count HLL to search_aggregates
Categories
(Data Platform and Tools :: General, enhancement, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: harter, Assigned: wlach)
References
Details
Attachments
(2 files)
BD is interested in client adjusted metrics (like searches per user). We can calculate these types of metrics in search_clients_daily, but these queries take a long while to run due to the size of that dataset.
Instead, we could add a client_count HLL column to search_aggregates. This would make client adjusted metrics easy to calculate.
Assignee | ||
Comment 2•7 years ago
|
||
(In reply to Ryan Harter [:harter] from comment #1)
> Will - do you want to take this bug?
I'm interested, when does it need to be done? ASAP?
Flags: needinfo?(wlachance) → needinfo?(rharter)
Reporter | ||
Comment 3•7 years ago
|
||
Q2 would be good, early May would be great. search_clients_daily is sufficing for now, but I expect the HLL would make some upcoming dashboard work much easier.
Flags: needinfo?(rharter)
Assignee | ||
Comment 4•7 years ago
|
||
Early May should be totally doable, I'm trying to wrap up some mission control work before I leave for a week in mid-April, if I can get to this before then I will.
Assignee: nobody → wlachance
Hi Ryan, Will,
Thanks for taking this up.
Is it possible to add a default_search_engine (not default_search_engine_data_name) columns as well in addition to the HLL.
Reporter | ||
Comment 6•7 years ago
|
||
Do you mind filing a separate bug for that work, arana? Thanks!
Added this to bug Bug 1451907. Hope this is good
Assignee | ||
Comment 8•7 years ago
|
||
Sorry for the delay on this, I've started work on this. Seems pretty straightforward with the use of approx_count_distinct.
Hi William,
approx_count_distinct does not work very well when it comes to performing an Year on Year analysis. The approx client count shows an acceptable error from the actual count but the when you do an YOY analysis on this data , the small error throws off the YOY %age increase/decrease since the %age changes are small (single digit). Please have a look at the attached sheet (YOY comparison.xlsx). As you can see, the Feb 2018 YOY drop has significantly changed. Kindly advice on this.
Flags: needinfo?(wlachance)
Comment 10•7 years ago
|
||
Assignee | ||
Comment 11•7 years ago
|
||
(In reply to arana from comment #9)
> Hi William,
>
> approx_count_distinct does not work very well when it comes to performing an
> Year on Year analysis. The approx client count shows an acceptable error
> from the actual count but the when you do an YOY analysis on this data , the
> small error throws off the YOY %age increase/decrease since the %age changes
> are small (single digit). Please have a look at the attached sheet (YOY
> comparison.xlsx). As you can see, the Feb 2018 YOY drop has significantly
> changed. Kindly advice on this.
I don't know if there's any reasonable alternative to using approx_count_distinct/hll for this task. It's known to be imprecise but getting an exact number is impossible given the memory requirements and the size of our dataset. Ryan, do you have any thoughts on this or alternative approaches? Is this even worth doing (I'm guessing a known imprecise number is still better than no number)?
Flags: needinfo?(wlachance) → needinfo?(rharter)
Comment 12•7 years ago
|
||
A few comments:
1. This can't be done until bug 1305087 is complete, since this is Python code.
2. The precision of HLLs is configurable by setting the number of bits. Currently we set it at 12 (this gives 2.3% error), we can increase it for this dataset - but we would need to test that the Presto HLL reader is configurable as well.
3. Since this is just another client_count dataset at the end of the day, if we fix bug 1405066 we could consider running this from the existing GenericCountView [0], and then not have to rely on bug 1305087.
[0] https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/GenericCountView.scala
Assignee | ||
Comment 13•7 years ago
|
||
Frank, are you sure you're thinking of approx_count_distinct? The bug you're referencing is referring to some seperate package called spark-hyperloglog, while approx_count_distinct is a standard part of pyspark:
http://spark.apache.org/docs/latest/api/python/_modules/pyspark/rdd.html#RDD.countApproxDistinct
Comment 14•7 years ago
|
||
(In reply to William Lachance (:wlach) (use needinfo!) from comment #13)
> Frank, are you sure you're thinking of approx_count_distinct? The bug you're
> referencing is referring to some seperate package called spark-hyperloglog,
> while approx_count_distinct is a standard part of pyspark:
>
> http://spark.apache.org/docs/latest/api/python/_modules/pyspark/rdd.html#RDD.
> countApproxDistinct
So maybe approxDistinct will support your use-case, but the problem is it only gives one number for each dimension. When you try to combine dimensions, you lose information on how many clients there are, since a single client can show up in multiple dimensions.
If there is no need for aggregation in this dataset, then you can use the single numbers. Otherwise they aren't sufficient, and that's why we don't use them for client_count datasets.
Reporter | ||
Comment 15•7 years ago
|
||
Frank appears to have covered most of this, but here's my two cents for the record.
(In reply to William Lachance (:wlach) (use needinfo!) from comment #11)
> Ryan, do you have any thoughts on this or alternative
> approaches? Is this even worth doing (I'm guessing a known imprecise number
> is still better than no number)?
Amit's correct in that HLL does have some disadvantages for trend lines and YoY analyses [1]. An exact count would be nice, but, as Frank notes, exact counts can't be aggregated. I don't think we'll be able to get these error bounds much tighter than a ~1.5% standard error. We won't be able to use this for YoY analysis, but the HLL will still be useful.
(In reply to Frank Bertsch [:frank] from comment #14)
> So maybe approxDistinct will support your use-case, but the problem is it
> only gives one number for each dimension. When you try to combine
> dimensions, you lose information on how many clients there are, since a
> single client can show up in multiple dimensions.
>
> If there is no need for aggregation in this dataset, then you can use the
> single numbers. Otherwise they aren't sufficient, and that's why we don't
> use them for client_count datasets.
We will definitely need aggregation. An HLL is the right tool for this job.
(In reply to Frank Bertsch [:frank] from comment #12)
> A few comments:
>
> 1. This can't be done until bug 1305087 is complete, since this is Python
> code.
Thanks for identifying Bug 1305087, setting that as a blocker.
> 3. Since this is just another client_count dataset at the end of the day, if
> we fix bug 1405066 we could consider running this from the existing
> GenericCountView [0], and then not have to rely on bug 1305087.
I'm fine with this, but porting the search_aggregates dataset to scala seems like more of a burden than figuring out how to register the scala function as a UDF.
[1] https://blog.harterrt.com/hll_trends.html
Depends on: 1305087
Flags: needinfo?(rharter)
Assignee | ||
Comment 16•7 years ago
|
||
Apologies for the delay on this, I've been rather swamped with mission control work. Now that amiyaguchi has fixed bug 1305087, this should be unblocked now. I will try to get this done early next week.
Comment 17•7 years ago
|
||
Assignee | ||
Comment 18•7 years ago
|
||
I have a tentative solution for this, but am blocked on testing due to bug 1466936
Depends on: 1466936
Assignee | ||
Comment 19•7 years ago
|
||
This is now reviewed and ready to land along with some other changes here: https://github.com/mozilla/python_mozetl/pull/246
Thanks to Marina Samuel (:emtwo) for help testing this change.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•5 years ago
|
Component: Datasets: Search → Datasets: General
Updated•3 years ago
|
Component: Datasets: General → General
You need to log in
before you can comment on or make changes to this bug.
Description
•