Closed
Bug 1271310
Opened 9 years ago
Closed 9 years ago
Measure the number of unique domains visited in a "session fragment"
Categories
(Toolkit :: Telemetry, defect, P3)
Toolkit
Telemetry
Tracking
()
RESOLVED
DUPLICATE
of bug 1271313
Tracking | Status | |
---|---|---|
firefox49 | --- | affected |
People
(Reporter: Dexter, Unassigned)
References
Details
(Whiteboard: [measurement:client])
This bug is about measuring the number of unique domains visited (unique TLDs?) over the session fragment. CDNs subdomain variations aren't interesting here.
We'll want this to utilize the places code for this, we already have some code like PLACES_PAGES_COUNT for guidance.
Context: https://docs.google.com/spreadsheets/d/1G33YEBL2-hSaF0-HrHPi0HVwANoTpYHsnevmMd1eZ1U/edit#gid=0
Reporter | ||
Updated•9 years ago
|
Comment 1•9 years ago
|
||
Does this include all loads or just toplevel loads? e.g. cheezburger.com loads an iframe from youtube.com, does that count as 1 or 2 unique domains?
Also specify: does this include private browsing or not?
How accurate does this need to be? I very much doubt that you want to use places for this. The places databases are not meant for this purpose and could be very inefficient for this task, and will be very perf-sensitive during shutdown. If we could use an in-memory hyperloglog that could mitigate the effect of keep thousands of records in memory.
Comment 2•9 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #1)
> I very much doubt that you want to use
> places for this. The places databases are not meant for this purpose and
> could be very inefficient for this task, and will be very perf-sensitive
> during shutdown.
This is for potential guidance of how to catch the relevant events etc., nothing else.
Comment 3•9 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #1)
> Does this include all loads or just toplevel loads? e.g. cheezburger.com
> loads an iframe from youtube.com, does that count as 1 or 2 unique domains?
>
> Also specify: does this include private browsing or not?
>
> How accurate does this need to be?
Flags: needinfo?(bcolloran)
Rebecca, please double-check my thinking on all of this--
> > Does this include all loads or just toplevel loads? e.g. cheezburger.com
> > loads an iframe from youtube.com, does that count as 1 or 2 unique domains?
I think just top-level. (If a person visits a page, and that pages loads content from a third party, I don't think the third party load says that much about the user's choice or intent)
> > Also specify: does this include private browsing or not?
I'm not sure. We want to know how many domains people have visited, and I think we'd be *interested* in including private browsing in that, but obviously we want to be super extra careful about anything having to do with private browsing. I don't know what is currently the standard, but I think we can live without including private browsing at this time, and
> > How accurate does this need to be?
I obviously don't really know how it works, but from what I've gathered places DB does sound scary, so avoiding it by using a probabilistic data structure would be fine for this. I don't care about the details of the data structure you might want to use, but just to put some numbers on it, I would target the following benchmarks:
(a) estimates should be unbiased and the distribution of estimates should be symmetric around the true value (equally probability of being above or below the true value, and not skewed in any complicated ways above or below the true value)
(b) with 95% probability, the estimate should be within 3% of the true value.
I think (a) is probably a requirement, but (b) I just made up. Rweiss sound about right?? ¯\_(ツ)_/¯
Flags: needinfo?(bcolloran) → needinfo?(rweiss)
oops, saw a squirrel mid-sentence--
> I don't know what is currently the standard, but I think
> we can live without including private browsing at this time, and
... I'm not sure it's important enough to go through a bigger policy review.
Comment 6•9 years ago
|
||
Agree with the above, with some additional positions:
1) Private browsing is private; I assert that we're seeking to describe typical non-private browsing intention with these measurements.
2) (a) is indeed a requirement and (b) is sufficient for now. We don't know what the tolerance parameter *should* be for such estimates, and 3% is probably reasonable enough a place to start. If all of the action is within that region, then we will need to get more precise, but without prior knowledge I don't feel strongly about a margin at this time.
Flags: needinfo?(rweiss)
Comment 7•9 years ago
|
||
Hello All, I want to work on this bug, Can I assign this to myself and start working on it?
Comment 8•9 years ago
|
||
(In reply to annakoppad from comment #7)
> Hello All, I want to work on this bug, Can I assign this to myself and start
> working on it?
We already scheduled this work. I recommend looking for a mentored bug following these sites:
https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Introduction#Step_2_-_Find_something_to_work_on
http://www.joshmatthews.net/bugsahoy/
Comment 9•9 years ago
|
||
I was curious what numbers we are actually optimizing here for.
We currently don't have any "# of unique domains visited" measure, but we have PLACES_PAGES_COUNT which measures unique URIs in its store.
That probe is unreliable for per-session/-subsession analysis as it is only submitted on "idle-daily", however, it can give an upper bound on the unique URIs visited per session & user, which again gives us a (very high) upper bound on the unique domains visited.
I'm not to sure if the query is sane, but i got the following percentiles for the "per user deltas in places_pages_counts":
percentile, value
0.1, 6
0.25, 18
0.5, 57
0.75, 158
0.9, 511
0.95, 3007
0.99, 32767
https://sql.telemetry.mozilla.org/queries/370/source#table
Comment 10•9 years ago
|
||
Georg, is that a question for me and Rebecca? If so, I'm not sure what your asking exactly when you mention the numbers we're optimizing for. Can you clarify?
Comment 11•9 years ago
|
||
Comment 12•9 years ago
|
||
Talking this over with Benjamin, the question came up on what level of detail we actually need here.
Most users will probably visit only few unique domains, while a few outliers will visit many.
Is it enough information here to count unique domains up to N (say 100)?
And throw the rest of them into a "over N unique domains visited per subsession" bucket?
Flags: needinfo?(bcolloran)
Comment 13•9 years ago
|
||
Does it make the implementation a lot trickier to just report the actual numbers for everyone? What is the advantage to truncating the data? Unless there is a concrete advantage I think it will be best to report the raw data. Agreed that many users will only visit a few domains, but long tails can be interesting.
Flags: needinfo?(bcolloran)
Comment 14•9 years ago
|
||
There are costs for recording arbitrary numbers (in a probabilistic data set), for the numbers of most users this seems like overkill.
The simplest thing would be to just record exact numbers for "unique domains visited" up to an upper bound (say 100 or 1000).
Over a certain limit we'd want to use a probabilistic data structure, but that makes things a little more complicated.
Unless there is a good reason for this and concrete questions to answer, i'd prefer not to.
If there is a good reason, i imagine we would be interested in bucketed values instead (say 1k to 10k, 10k to 50k, ...).
Flags: needinfo?(bcolloran)
Comment 15•9 years ago
|
||
Gotcha. Sorry, I'd forgotten that we were talking about a probabilistic data structure up-thread for this measure.
In that case, I agree with you that setting a fixed N is acceptable. I think N=1000 feels pretty safe to start, but if that would use too much memory or something, we could try N=100. In either case, if we end up truncating too aggressively, we'd need to revisit this and bump up the N. If we start collecting data and find that more than 1% of subsessions end up in the "over N unique domains visited per subsession" bucket, we'll need to increase N.
Does that sound ok?
Flags: needinfo?(bcolloran)
Comment 16•9 years ago
|
||
That sounds good to me, thanks.
Reporter | ||
Comment 17•9 years ago
|
||
Should we consider mail.foo.com and bar.foo.com two different domains?
Flags: needinfo?(bcolloran)
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(rweiss)
Reporter | ||
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(rweiss)
Flags: needinfo?(bcolloran)
Resolution: --- → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•