Collect number of page visits in captcha detection ping
Categories
(Core :: Privacy: Anti-Tracking, task)
Tracking
()
Tracking | Status | |
---|---|---|
firefox136 | --- | fixed |
People
(Reporter: fkilic, Assigned: fkilic)
References
Details
Attachments
(2 files)
We need a divisor for some of the metrics. Pages visited is a good candidate.
Assignee | ||
Comment 1•25 days ago
|
||
Comment 2•25 days ago
|
||
I'm not sure who to get review from on this patch, but I would like more eyes on it.
We have a separate ping that tracks the number of different types of captchas seen by users. As described in DO-1901 we want to do the following analyses:
-
Do certain privacy settings provide greater incident rates of CAPTCHAs, or harder CAPTCHAs (as inferred by higher rates of failures/more attempts.)
-
Has a new Firefox feature or behavior (e.g. an introduction or extension of a privacy setting, or just a new web API being introduced) caused an increase in CAPTCHA rates.
To do these we intend to divide the number of captchas seen by the number of documents visited to get a general CAPTCHA rate of web browsing. Obviously we need the number of documents visited to do this.
We already collect this, so why are we re-implementing it? The same reason we had to make a separate ping in the first place - we want to know the CAPTCHA rate for different privacy settings, so if a user changes their privacy setting, we need to reset the metrics and start fresh. Therefore when one of the relevant prefs changes, we submit the ping and reset the metrics. So we record the data in the same place, a second time, specifically for our ping.
Does this seem alright? I'm not thrilled about it, but we definitely can't use the number from the main ping, because besides the preference change situation, the data collection timeframe for the two pings is not likely to be identical.
Assignee | ||
Comment 3•25 days ago
|
||
For :chutten: just fyi, regarding your heads up about use counter's criteria, I checked ShouldIncludeInTelemetry
and it actually looks like a good filter. So I just added my metric addition next to the already existing page load counting mechanism.
Comment 4•25 days ago
|
||
So, I'm lacking a lot of context on bug 1913421 fwiw, but at a glance:
- Collecting things conditional on PBM sounds slightly sketchy to me, but alright.
- Since you're using the same place, it seems you could derive one of the counters? e.g., pbm pages == total pages - non-pbm pages, or so. But maybe there are complexities about the different pings why that couldn't be done?
- It seems like the you want to do should more generally be part of the use counters, it does feel a bit sketchy to hook into that code only half-way. But fine I guess. I'm not familiar with the data side of things, and with how many different configurations are you tracking. Conceptually you could make this regular counters with one use counter per captcha provider and configuration you care about, right? Was that considered by any chance?
Comment 5•24 days ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #4)
- Collecting things conditional on PBM sounds slightly sketchy to me, but alright.
I was a little surprised also, but it turns out that we collect telemetry from PBM normally, we have a few places we do it conditionally on pbm. Based on our Nightly data we have what sure looks like statistically strong evidence that the behavior in PBM vs non-PBM is different which is not a strong justification but is a small weight on the scale. This metric does record pretty aggressively how much PBM is used though - not just that it was used. I'll speak with Luke and get his thoughts and maybe chutten has thoughts from a data steward POV also.
- Since you're using the same place, it seems you could derive one of the counters? e.g., pbm pages == total pages - non-pbm pages, or so. But maybe there are complexities about the different pings why that couldn't be done?
The total_pages being recorded here isn't an apples-to-apples comparison with the two metrics being added (because the pings can cover different timespans) so we can't derive, we still need two of the three to derive the third.
- It seems like the you want to do should more generally be part of the use counters, it does feel a bit sketchy to hook into that code only half-way. But fine I guess. I'm not familiar with the data side of things, and with how many different configurations are you tracking. Conceptually you could make this regular counters with one use counter per captcha provider and configuration you care about, right? Was that considered by any chance?
We had not considered UseCounters. IIUC the advantage of UseCounters is it derives the percentages for Page/Document loads automagically?
We would need [Number of CAPTCHa Providers] x [Number of Settings] Use Counters... That would be 19 captcha metrics x 7 settings + 19 captcha metrics in pbm x ~7 settings = 266 new UseCounters? That seems like an awful lot...
Also maybe this would not work. The idea would be you increment the UseCounter only on the setting you currently have - but when the UseCounter data is submitted it presumably submits Total [OCcures, Document/Page Loads]. In the backend we'll have a bunch of data from users like {[0, 500], [1, 500]}, {[0, 100], [0, 100]} - in the first user's data We know they had the second setting enabled, but the first UseCounter is supposed to be ignored when avergaing all users data - but it won't be. And in the second user's data we don't even know which setting is enabled and which should be ignored or not... So I'm not sure how things work in the backend but I think we would not be able to use UseCounters in the default way and we'd need to adapt the data before consuming it and at that point we're pretty close to what we're doing already.
Comment 6•23 days ago
|
||
PBM use, and usage contextualized by whether it happened under PBM has actually been unsketchy for quite a few years, so no worries there. I'm also a big fan of more, more-specific, more-narrow collections of similar things that are less likely to a) be changed by one of the several parties interested in it, b) accidentally including too much or too little because of competing requirements, c) named something too generic in a way that makes analysts think they can also use it for Thing X when it definitely isn't to be used for Thing X.
(In reply to Fatih Kilic [:fkilic] from comment #3)
For :chutten: just fyi, regarding your heads up about use counter's criteria, I checked
ShouldIncludeInTelemetry
and it actually looks like a good filter. So I just added my metric addition next to the already existing page load counting mechanism.
As an example of this ("accidentally including too much or too little because of competing requirements") in practice, ShouldIncludeInTelemetry
returns true
for about:preferences
. Might be a bug, might not be.
Assignee | ||
Comment 7•10 days ago
|
||
Comment 9•10 days ago
|
||
Backed out for landing the wrong patch version
Backout link: https://hg.mozilla.org/integration/autoland/rev/2ea0e69d7c1839c5e3a0d244985fc41a220643bf
Comment 10•10 days ago
|
||
Comment 11•10 days ago
|
||
Comment 12•9 days ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/a726284e1c88
https://hg.mozilla.org/mozilla-central/rev/9118e2f04aa3
Description
•