Closed Bug 1941596 Opened 25 days ago Closed 9 days ago

Collect number of page visits in captcha detection ping

Categories

(Core :: Privacy: Anti-Tracking, task)

task

Tracking

()

RESOLVED FIXED
136 Branch
Tracking Status
firefox136 --- fixed

People

(Reporter: fkilic, Assigned: fkilic)

References

Details

Attachments

(2 files)

We need a divisor for some of the metrics. Pages visited is a good candidate.

I'm not sure who to get review from on this patch, but I would like more eyes on it.

We have a separate ping that tracks the number of different types of captchas seen by users. As described in DO-1901 we want to do the following analyses:

  1. Do certain privacy settings provide greater incident rates of CAPTCHAs, or harder CAPTCHAs (as inferred by higher rates of failures/more attempts.)

  2. Has a new Firefox feature or behavior (e.g. an introduction or extension of a privacy setting, or just a new web API being introduced) caused an increase in CAPTCHA rates.

To do these we intend to divide the number of captchas seen by the number of documents visited to get a general CAPTCHA rate of web browsing. Obviously we need the number of documents visited to do this.

We already collect this, so why are we re-implementing it? The same reason we had to make a separate ping in the first place - we want to know the CAPTCHA rate for different privacy settings, so if a user changes their privacy setting, we need to reset the metrics and start fresh. Therefore when one of the relevant prefs changes, we submit the ping and reset the metrics. So we record the data in the same place, a second time, specifically for our ping.

Does this seem alright? I'm not thrilled about it, but we definitely can't use the number from the main ping, because besides the preference change situation, the data collection timeframe for the two pings is not likely to be identical.

Depends on: 1913421
Flags: needinfo?(emilio)
Flags: needinfo?(chutten)

For :chutten: just fyi, regarding your heads up about use counter's criteria, I checked ShouldIncludeInTelemetry and it actually looks like a good filter. So I just added my metric addition next to the already existing page load counting mechanism.

So, I'm lacking a lot of context on bug 1913421 fwiw, but at a glance:

  • Collecting things conditional on PBM sounds slightly sketchy to me, but alright.
  • Since you're using the same place, it seems you could derive one of the counters? e.g., pbm pages == total pages - non-pbm pages, or so. But maybe there are complexities about the different pings why that couldn't be done?
  • It seems like the you want to do should more generally be part of the use counters, it does feel a bit sketchy to hook into that code only half-way. But fine I guess. I'm not familiar with the data side of things, and with how many different configurations are you tracking. Conceptually you could make this regular counters with one use counter per captcha provider and configuration you care about, right? Was that considered by any chance?
Flags: needinfo?(emilio)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #4)

  • Collecting things conditional on PBM sounds slightly sketchy to me, but alright.

I was a little surprised also, but it turns out that we collect telemetry from PBM normally, we have a few places we do it conditionally on pbm. Based on our Nightly data we have what sure looks like statistically strong evidence that the behavior in PBM vs non-PBM is different which is not a strong justification but is a small weight on the scale. This metric does record pretty aggressively how much PBM is used though - not just that it was used. I'll speak with Luke and get his thoughts and maybe chutten has thoughts from a data steward POV also.

  • Since you're using the same place, it seems you could derive one of the counters? e.g., pbm pages == total pages - non-pbm pages, or so. But maybe there are complexities about the different pings why that couldn't be done?

The total_pages being recorded here isn't an apples-to-apples comparison with the two metrics being added (because the pings can cover different timespans) so we can't derive, we still need two of the three to derive the third.

  • It seems like the you want to do should more generally be part of the use counters, it does feel a bit sketchy to hook into that code only half-way. But fine I guess. I'm not familiar with the data side of things, and with how many different configurations are you tracking. Conceptually you could make this regular counters with one use counter per captcha provider and configuration you care about, right? Was that considered by any chance?

We had not considered UseCounters. IIUC the advantage of UseCounters is it derives the percentages for Page/Document loads automagically?

We would need [Number of CAPTCHa Providers] x [Number of Settings] Use Counters... That would be 19 captcha metrics x 7 settings + 19 captcha metrics in pbm x ~7 settings = 266 new UseCounters? That seems like an awful lot...

Also maybe this would not work. The idea would be you increment the UseCounter only on the setting you currently have - but when the UseCounter data is submitted it presumably submits Total [OCcures, Document/Page Loads]. In the backend we'll have a bunch of data from users like {[0, 500], [1, 500]}, {[0, 100], [0, 100]} - in the first user's data We know they had the second setting enabled, but the first UseCounter is supposed to be ignored when avergaing all users data - but it won't be. And in the second user's data we don't even know which setting is enabled and which should be ignored or not... So I'm not sure how things work in the backend but I think we would not be able to use UseCounters in the default way and we'd need to adapt the data before consuming it and at that point we're pretty close to what we're doing already.

PBM use, and usage contextualized by whether it happened under PBM has actually been unsketchy for quite a few years, so no worries there. I'm also a big fan of more, more-specific, more-narrow collections of similar things that are less likely to a) be changed by one of the several parties interested in it, b) accidentally including too much or too little because of competing requirements, c) named something too generic in a way that makes analysts think they can also use it for Thing X when it definitely isn't to be used for Thing X.

(In reply to Fatih Kilic [:fkilic] from comment #3)

For :chutten: just fyi, regarding your heads up about use counter's criteria, I checked ShouldIncludeInTelemetry and it actually looks like a good filter. So I just added my metric addition next to the already existing page load counting mechanism.

As an example of this ("accidentally including too much or too little because of competing requirements") in practice, ShouldIncludeInTelemetry returns true for about:preferences. Might be a bug, might not be.

Flags: needinfo?(chutten)
Pushed by tritter@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/8a3aed51c9e1 Collect number of page visits in captcha detection ping. r=tjr https://hg.mozilla.org/integration/autoland/rev/ad53b6d0d168 Re-enable captcha detection in non-nightly builds. r=tjr

Backed out for landing the wrong patch version

Backout link: https://hg.mozilla.org/integration/autoland/rev/2ea0e69d7c1839c5e3a0d244985fc41a220643bf

Backout by smolnar@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/30c7989462b5 Backed out 2 changesets for landing the wrong patch version.
Pushed by fkilic@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a726284e1c88 Collect number of page visits in captcha detection ping. r=tjr https://hg.mozilla.org/integration/autoland/rev/9118e2f04aa3 Re-enable captcha detection in non-nightly builds. r=tjr
Status: ASSIGNED → RESOLVED
Closed: 9 days ago
Resolution: --- → FIXED
Target Milestone: --- → 136 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: