1261467 - Privacy review on domain name collection

Reporter

Description

•

10 years ago

Attached file Documentation of domain name collection — Details

For Hello we would like to understand what kind of tasks our users find Hello useful for. I have attached for review a link to a document that describes the implementation and reasoning for a measurement: https://github.com/mozilla/loop/blob/493e71adf4d51ec23ff379c163255197ff1a3341/docs/DataCollection.md#domain-logging

Ian Bicking (:ianbicking)

Reporter

Updated

•

10 years ago

Attachment #8737327 - Flags: review?(benjamin)

Romain Testard [:RT]

Updated

•

10 years ago

Depends on: 1211542

chris hofmann

Comment 1

•

10 years ago

https://support.mozilla.org/en-US/kb/share-telemetry-data-mozilla-help-improve-firefox?redirectlocale=en-US&redirectslug=send-performance-data-improve-firefox indicates the following behavior. "... Telemetry measures and collects non-personal information, such as memory consumption, responsiveness timing and feature usage. It then sends this information to Mozilla on a daily basis and we use it to make Firefox better for you. " For decades privacy advocates and browser development organizations have treated browsing history as personal information. Adding browsing history tracking to telemetry for the purposes stated goes against this historical precedence, and if this is implemented it should require some clear changes to several of our privacy statements since we are no longer limiting our collection to feature usage within the browser, but the sites they view on the web. Some additional concerns were also raised in the dev.platform discussion of this feature as well and should be considered as part of the privacy review. See: https://groups.google.com/forum/#!topic/mozilla.dev.platform/nyVkCx-_sFw Namely: Can alternate methods be used to understand how people use the web, including surveys and research studies? What is the thesis about hello users use patterns that we are attempting verify or dismiss? What will actually be done with the data once collected, and how might it influence product UI/UX? Can the UI/UX changes be developed, tested, and evaluated independently and/or instead of this data collection as a better way to get direct user feedback on the planned changes to the hello feature?

chris hofmann

Comment 2

•

10 years ago

https://en.wikipedia.org/wiki/Web_browsing_history has a good summary of user expectations around browsing history. One additional thing mentioned there is the concept of users having control over their history of sites visited. If we collect browsing history and store it on Mozilla servers for 6 months (as orginally planned, or less -- being considered) users still will lose control over where their history is stored, and/or removed. This has the potential of opening up liability and require cooperation from law enforcement and government agencies to participate in the discovery of users that might come under investigation. We should be cautious about gathering such data and being required to participate in these kinds of activities.

Ian Bicking (:ianbicking)

Reporter

Updated

•

10 years ago

Depends on: 1262832

chris hofmann

Comment 3

•

10 years ago

the attachment on the bug is actually just a link to https://github.com/mozilla/loop/blob/493e71adf4d51ec23ff379c163255197ff1a3341/docs/DataCollection.md#domain-logging I'd ask these questions from that doc to get a better understanding on the effectiveness of this approach v. other possible approches where we would just survey or study users directly to get at why they use hello, in what circumstances they use it, and what they they try to accomplish when using it. re: Given some knowledge of how people use Hello we can improve the Hello experience given the specific tasks In what ways have we thought about or designed possible changes/improvements to the experience? Is there any documentation available for those different experience? Do we have any indication/data from user testing that these improvements will work for users and improve the experience? re: we can infer from the sites people share with Hello. How will this inference work? and what possible problems might exist with false positives and misdirection on the inference? For example, we will get back many reports about shared visits to Facebook. e.g. logs will show only facebook.com How will we get even high level inference of whether people are planning events, sharing photos, playing games, engaging in commerce, showing where other friends are around the world on a map, or assisting other with general help? More detail information will be required to get this kind of understanding such as https://www.facebook.com/ https://www.facebook.com/farmville-two/ https://www.facebook.com/events/upcoming?action_history=null https://www.facebook.com/salegroups/?group_sell_ref=landing_page_bookmark&view=landing_page_your_groups https://www.facebook.com/livemap/?ref=bookmarks Most popular websites do more than one thing so to understand what users are doing on the site we would need detailed URL with directory path information, and deconstruction of how the site works. Either is planned. Even google which offers mostly search results has a variety of news feeds and offerings under the google.com domain. https://google.com/ https://google.com/news https://google.com/finance https://google.com/about/careers/ https://google.com/press/ https://google.com/press/blog-directory.html https://google.com/permissions/ Trying to understand the reasons behind particular visits to any particular domain is a complex problem which will be magnified by 2000-3000 sites we would intend to collect data on, many of which will not be in the language of product managers and engineers that will look at the data. re: We report only on visited domains that are part of a whitelist that is shipped with Hello. why is a whitelist approach chosen? what kind of bias will this introduce into understanding of how people are using hello? what kind of gaps or blind spots will the white list leave in our understanding of how users might be using hello? re: This whitelist is constructed from the top 2000 visited domains, plus some additions of the top domains in some areas of interest (home, travel, shopping) when those domains don't themselves show up in the top 2000. 63% of hello users are outside the US and/or don't speak English. https://docs.google.com/spreadsheets/d/1R8_m7FYB9A24JRk-D2xqbpA0IMnYbZat609gRFp6mxY/edit#gid=1517829759 what effort has been made to ensure the list is representative of international use patterns and how do we intend to understand and assess URLs that might be shared by non-english speaking users. If we detect a problem with any particular whitelist entry for which we would rather not collect data, or we find a domain that we want to collect data for how is the whitelist updated?

chris hofmann

Comment 4

•

10 years ago

also, I think its commonly agreed that hello usage is limited and probably only in use largely by early adopters. if the goal is to understand use cases that might bring hello to mainstream users and use cases how will this data collection help in achieving that goal? seems unlikely to help and other survey and observation methods would be more helpful in both understanding the data we are getting and who its coming from. Seem like working on bugs like "Bug 1262614 - add a way for users to provide targeted feedback for hello features, and better ways to report out this feedback" are better first steps in gaining a better undertanding hello users, what they are doing with hello, and what obstacles are in their path.

chris hofmann

Comment 5

•

10 years ago

the spec mentions that some domains related to home, travel, shopping were added to the list in the interest in understanding sharing of URLs related to those kinds of activities. has anyone reviewed and classified the other 3000 urls listed in https://github.com/mozilla/loop/blob/493e71adf4d51ec23ff379c163255197ff1a3341/add-on/chrome/modules/DomainWhitelist.jsm to understand the breath and depth of classifications of browsing history that we might be gathering data on, and to confirm intended filtering of categories such as porn are actually absent from the list? gotporn.com, upornia.com mundosexanuncio.com seems like a few obvious sites to check and remove. others will be harder to detect if the domain is in an unfamiliar language or intentionally obscure domain name is used. have we removed domains where the bulk of the content is behind a login and we really don't know what kind of content is served or how we might classify it even if we got reports of users sharing a URL on a domain like http://www.nowvideo.sx/ ? quick glance shows that we could be collecting data on possible sensitive browsing history such as interest in: healthcare related sites - medicare.gov, hopkinsmedicine.org, mayoclinic.org, health.com, healthboards.com, healthcare.gov, healthgrades.com, healthline.com, healthstream.com, heart.org, familydoctor.com.cn, familydoctor.org vitals.com interest in possible legal and illegal drug use - drugabuse.gov, drugs.com, drugstore.com has anyone scanned the proposed whitelist to remove sites sites of little value such as variety of ad networks, sites that appear to do nothing ( http://exhentai.org/ , http://www.gougou.com/ etc.), 404's and such? The value of the smallest list possible is that it will make classification and categorization easier to establish and maintain for checks like the ones suggested above, and help to understand bias brought about by listing some sites and types of sites and not others. Espcially if we need to try and deconstruct the layout of the site and the content served to understand what the users might be doing on the site like mentioned in comment 3. In general, the list seams dominated by eCommerce. If we primarily ask users to provide domains where they might be shopping and sharing that will give a tilted and bias view that that is what they do with hello. Can we determine classification of the sites in the list before we start shipping it to understand this possible bias?

Ian Bicking (:ianbicking)

Reporter

Comment 6

•

10 years ago

> Can alternate methods be used to understand how people use the web, including surveys and research studies? We have done surveys and studies of potential users for Hello, but this data is often optimistic, the people surveyed may be biased by how the survey was being presented, and of course the value people expect is different from the value people find. > What is the thesis about hello users use patterns that we are attempting verify or dismiss? We would like to gain insight about the value people are finding in the tool, rather than a binary yes/no verification. Some things we could detect from this data: 1. It's all Mozillians using the tool, doing Mozilla things on Mozilla sites. 2. People find it particularly useful for travel, or shopping, or some category along those lines. 3. People find it particularly useful for looking at reviews with each other (technically "shopping", but a different kind of categorization) 4. It's all enterprisey sites. 5. We see what appears to be a community of users has found the tool, if that community of users also is associated with a whitelisted site. 6. We see some local sites develop popularity (many sites are local on the whitelist), and there is a regional interest or word-of-mouth growth happening. These local sites might also be defined in a way we currently do not understand (since the categories we construct are a byproduct of our cultural lens). 7. There's no patterns at all. > What will actually be done with the data once collected, and how might it influence product UI/UX? First we will be looking for patterns to develop. The product experience in Hello has changed, and we don't expect people to immediately develop useful patterns for how to get value from Hello. If we see a use case emerging as popular we might do more qualitative studies to see how to better address those use cases. Implied uses for Hello give us some basis for how we construct our user stories and the examples we use to guide our design. The examples we use when we imagine and develop features have a lot of power to hide or reveal flaws and opportunities in the designs. Our examples now are aspirational, not empirical. If we see people finding value in certain areas we could help guide our product experience to introduce more people to those use cases, through the FTU or marketing messages. But because we aren't correlating this data with any other data we are somewhat limited. > Can the UI/UX changes be developed, tested, and evaluated independently and/or instead of this data collection as a better way to get direct user feedback on the planned changes to the hello feature? We do this testing currently, but we make many assumptions in the process. This is one way to confirm some of those assumptions. For instance, we have done user testing, giving people tasks to perform with Hello. This has been interesting, but WE have made up the tasks, and so we aren't thinking about tasks that other people have organically found to be useful. > If we collect browsing history and store it on Mozilla servers for 6 months (as orginally planned, or less -- being considered) users still will lose control over where their history is stored, and/or removed. We only store the data in a way that aggregates user activity. So for each of the domains on the whitelist, you will see a count of how many sessions have included that domain. There's no individual history represented in the data. > In what ways have we thought about or designed possible changes/improvements to the experience? For instance, a broad distinction in how people work with Hello: is it being used to inform other people about things on the web, to work with another person collaboratively, or to make and project specific observations? I think we can get some empirical insight into this via this collection. If it about informing another person, then the record of visited URLs might be most useful and should be highlighted. If it's about collaboration, then we might want to prioritize collaborative interaction or handoff techniques. If it's about making observations, then we might want ways to draw attention more powerfully. > Is there any documentation available for those different experience? There is, but I'm not sure what is public as part of our partnership. We have a fairly long history of speculative proposed experiences in Hello. > Do we have any indication/data from user testing that these improvements will work for users and improve the experience? Our partner has done market analysis which shows broad potential for value around decision making and collection of resources for discussion. Our marketing department has also done studies in this area, but again I'm afraid I'm not sure what is public (most of it is hosted on Google Drive). But these are studies of opportunity, and are not verified against a particular implementation. > re: we can infer from the sites people share with Hello. > How will this inference work? We'll have to make judgements given the results of the data collection. > and what possible problems might exist with false positives and misdirection on the inference? Domains are incomplete knowledge to be sure. We don't have ways to separate out long (valuable) sessions from short ones, successful sessions from failed ones, users who have many sessions from users who have a smaller number of sessions. We are throwing out a lot of information in this collection, in an effort to maintain anonymity of sensitive data. > For example, we will get back many reports about shared visits to Facebook. e.g. logs will show only facebook.com Yes, that is a limit. We don't have something for paths that is equivalent to the ranked list of domains, to provide a source. > re: We report only on visited domains that are part of a whitelist that is shipped with Hello. > why is a whitelist approach chosen? Domains can be sensitive. The whitelist includes popular domains that we don't believe can be traced back to any user or group. > what kind of bias will this introduce into understanding of how people are using hello? We will be biased towards broadly applicable use cases, rather than high-value niche use cases. We may miss non-obvious uses underneath domains as well (such as google.com/news). > what kind of gaps or blind spots will the white list leave in our understanding of how users might be using hello? We can only infer the value users are getting out of Hello given domains, it doesn't give us direct evidence. > what effort has been made to ensure the list is representative of international use patterns and how do we intend to understand and assess URLs that might be shared by non-english speaking users. We used the international top list of sites, not a country-specific one. We haven't identified any locales of interest for Hello, if we did then it would probably make sense to augment this list further for those locales. In my experience it's not very hard to get a basic understanding of many of these sites even as a foreigner, but if we see specific interest it gives us clear questions to follow up on. > If we detect a problem with any particular whitelist entry for which we would rather not collect data, or we find a domain that we want to collect data for how is the whitelist updated? It would require an update to the add-on. Note that while we try to remove adult sites, I don't believe that we are compromising privacy even with the inclusion of those sites. > if the goal is to understand use cases that might bring hello to mainstream users and use cases how will this data collection help in achieving that goal? I think we'll need to watch this data develop; it is possible they never develop in such a way that we get valuable insight from them. > gotporn.com, upornia.com mundosexanuncio.com seems like a few obvious sites to check and remove. others will be harder to detect if the domain is in an unfamiliar language or intentionally obscure domain name is used. This is the list we used: https://github.com/matthewruttley/contentfilter/blob/master/sites.json (noted in https://bugzilla.mozilla.org/show_bug.cgi?id=1211542#c25) I don't believe that complete (maybe any) elimination of these sites is necessary to provide anonymity to users. But I've made a followup about these in Bug 1263020 > Can we determine classification of the sites in the list before we start shipping it to understand this possible bias? In my (not comprehensive) research I haven't found a classification source that seems useful.

Ian Bicking (:ianbicking)

Reporter

Updated

•

10 years ago

Depends on: 1263020

Randell Jesup [:jesup] (needinfo me)

Comment 7

•

10 years ago

(In reply to chris hofmann from comment #3) > Trying to understand the reasons behind particular visits to any particular > domain is a complex problem which will be magnified by 2000-3000 sites we > would intend to collect data on, many of which will not be in the language > of product managers and engineers that will look at the data. Exactly - domain != reason in a pretty high % of cases, or at best it's speculative. > re: We report only on visited domains that are part of a whitelist that is > shipped with Hello. > > why is a whitelist approach chosen? > what kind of bias will this introduce into understanding of how people > are using hello? > what kind of gaps or blind spots will the white list leave in our > understanding of how users might be using hello? Does it collect data for site-not-in-whitelist? > 63% of hello users are outside the US and/or don't speak English. quite so - a "gold-standard" whitelist would use a top-regional/language-sites approach; admittedly hard or impractical. Top sites in India may (other than facebook and a small number of others) be entirely off the Alexa list. We should at least understand how this pattern is related to language or region, and this (if we're doing it) should be exposable in the telemetry data (the data for the correlation may be there, though it may increase the risk of de-anonymization, but that only helps if people are thinking about or expecting this to be an issue). > If we detect a problem with any particular whitelist entry for which we > would rather not collect data, or we find a domain that we want to collect > data for how is the whitelist updated? The whitelist collection should be able to be disabled entirely either by the user (if only via a config var, preferably something more visible on about:telemetry, perhaps something exposed further (at least the first time Hello is used) like reporting is on first-profile-run, in the Hello drop-down's config menu ('gear')). We don't need to do anything to advance a story of our disregarding threats to user privacy, even if you're totally right that it doesn't. Being right is somewhat cold comfort if news articles and commenters don't realize this, or don't agree (part of why such a list in the source needs internal commentary (or at least a link!) to the analysis that shows it's ok, and this needs to be explicitly publicized as part of the discussion of adding this feature (blog, whatever).

chris hofmann

Comment 8

•

10 years ago

From Ian's answers in comment 6 is pretty obvious that the logging and reporting solution that has been proposed is really more a puzzle consisting of three pieces that look like selected sites visited by users = understanding of hello usage = plan to streamline popular interactions than an experiment. An experiment has a tight thiesis of what the outcome might look like, the test is run, data is gathered and analyzed, changes are made, then the test is run to see if we have changed the outcome. None of that scientific process seems to apply to this proposal. The answers show the weak links making the first connections of the puzzle and even weaker linking to the intended result which is to improve hello's interactions for a few general use cases. They also show this is more a brainstorming and exploratory data gathering exercise than something specific and very targeted at learning what we need to know to fix product defects, or directly enhance the user experience. It suggests this data collection and guessing about what it means a better approach than just speaking directly with the existing hello user base with surveys, studies, and forum discussions to ask them directly what they use hello for and how it could be streamlined. I have to respectfully disagree, and I've suggested a variety of ways to get at the end goal much faster and with out compromising user privacy the mozilla/firefox brand around privacy that we are spending lots of money, time, and effort to create. Those suggestions are in https://bugzilla.mozilla.org/show_bug.cgi?id=1262614 If we tried the approach suggested in that bug first and were not satisfied with the results then other approaches like this data collection might seem appropriate if we also inform users and ask there permission to providing more details similar like we do with crash reporting dialogs and data collection. In summary there are alternatives that we can/should peruse first to learn directly about user behavior, and I'm pretty convinced they are a better fit for the results we seek. We should take those steps first before risking user privacy and our privacy brand for little marginal value. Privacy advocates are also upping their game with respect to what we are trying to do here. The are demanding that software development organization and governments not only protect the "lock boxes" and encrypted channels that users should expect privacy on, but the "puzzles" that can infer what users are doing. https://twitter.com/marciahofmann/status/718161390360342528?s=03 does a good job of explaining why this is important.

chris hofmann

Comment 9

•

9 years ago

RE: > Can we determine classification of the sites in the list before we start shipping it to understand this possible bias? >> In my (not comprehensive) research I haven't found a classification source that seems useful. One simple classification method would be to spider all the sites in the list and capture meta tags that the sites advertise. That might also help to spot porn, medical, or other sensitive categories in the list that you might want to remove. A simple listing that would at least let you know what's in the data set of sites would look something like this. aafp.org <meta name="keywords" content="aafp, american academy of family physicians"/> <meta name="description" content="American Academy of Family Physicians represents more than 115,900 family physicians, family medicine residents, and medical students."/> txxx.com <meta name="description" content="Watch free porn videos online on your desktop or mobile phone. Free full length XXX movies. Share your own sex videos on TubeCup.com" /> <meta name="keywords" content="tube cup,porn tube, xxx tube, free porn videos, free porn xxx movies, xxx tube video, free xxx vidio clips, xxxtube" />

chris hofmann

Comment 10

•

9 years ago

One other meta tag worth watching for might be: <meta name="RATING" content="RTA-5042-1996-1400-1577-RTA" />

Benjamin Smedberg

Comment 11

•

9 years ago

Comments on the doc: "submits a list of domains visited to the Hello server". This should have a more specific description of the URL scheme used. Is there information encoded in the URL about the version of Firefox or Loop or any other metadata? I presume this is HTTPS. Is there enhanced key pinning in effect for this collection domain? We discussed making sure this data was submitted as an anonymous (no-cookies) HTTP request. That should be done to make sure there is less technical opportunity for tracking, and added to the docs. I presume that the domains are a prefix match and not an exact match? So sharing on docs.google.com would be picked up in the google.com bucket? It's worth making that explicit. The doc should answer the question of what happens for sharing of content not in the whitelist. Is that excluded entirely or submitted in some other generic bucket? I am uncomfortable with this data collection, for reasons relating to protecting the Firefox brand. Those concerns do not constitute blockers for this being implemented. If there were a clear way to solve the same problem without collecting domains, we *should* choose that. I am relying on Ian and Romain's expert assertion that there is no other way to solve the product problem at hand. This doc, with the changes noted above, are sufficient documentation and are in line with the signoff that Nick Nguyen made about this being acceptable as part of Hello. What is still a blocker is how we are going to notify users about this and give them control. Firefox Hello has a separate privacy policy which needs to be updated. Marshall Erwin is responsible for certifying that the changes to the privacy policy meet both legal and policy requirements. Because this collection is much more intrusive usage metrics than we previously collected, we need to expose explicit user control to disable this collection somewhere. I would like to review how users will be able to accomplish that.

Benjamin Smedberg

Updated

•

9 years ago

Attachment #8737327 - Flags: review?(benjamin)

chris hofmann

Comment 12

•

9 years ago

> re: I presume that the domains are a prefix match and not an exact match? So sharing on docs.google.com > would be picked up in the google.com bucket? It's worth making that explicit. yeah, this would be required to understand where the actual visits are and what users are trying to do, but also has potential for leaking private user data depending how the filtering/wild carding is done on some sites. this is also required to fix part of the problem with the alexa list used as the baseline not being very good at understanding the exact domain location of popular content. See: https://bugzilla.mozilla.org/show_bug.cgi?id=1263020#c2

chris hofmann

Comment 13

•

9 years ago

>> re: If there were a clear way to solve the same problem without collecting domains, we *should* choose >> that. I am relying on Ian and Romain's expert assertion that there is no other way to solve the product >> problem at hand. In Ian's response above he outlined a few attempts at small scale data collection among targeted users, or prospective users, of hello. He believed these small scale and target to have too much bias, but the proposed study also has bias injected. e.g. not collecting porn sites (execept in cases where we've made errors in identifying porn content sites from the alexa list, not tying the sample of top international sites to our international user populations of firefox and hello, the bias that predicts its early adopters that are currently trying to use hello v. the target audiences that we are trying to build, and a variety of other biases in the whitelist and predicted way we will categorize sites. So the study doesn't solve the bias problem. It does move from a small amount of data collection to a wider set of data collection, but we haven't heard on why a simple survey target at a wider audience would not work better toward reaching direct answers to the direct questions that we have about hello usage. Ian also outlined the kind of data he was looking to gather and things he intended to learn. Here is his list and some potentially better ways to get the data under each line item. 1. It's all Mozillians using the tool, doing Mozilla things on Mozilla sites. Just do an employee survey. Make it mandatory or compensated to understand how many mozillians are using hello and ask what they are using it for. We have user data that tells us the number of hello users in the world and this employee survey would give us the exact percentage of mozillians. Lots of "mozilla work" happens off the mozilla.com/org domains so there are chances reading mozillan usage wrong with lots of hits as pointers to newsgroup discussions on http://groups.google.com/ or general news about the technology industry and mozilla's role in that. The way in which we intend to understand "mozillian use" is also suspect. It might be unlikely that mozillians share mozilla.org sites but have found wider sets of sites and use cases that are of interest. If we ask them directly "do you use hello for work or non-work activities?" we will get a direct answer. 2. People find it particularly useful for travel, or shopping, or some category along those lines. 3. People find it particularly useful for looking at reviews with each other (technically "shopping", but a different kind of categorization) Until we categorize all the 2000 sites in the proposed whitelist we won't have a good understanding of the pct. of travel, shopping, or any other categories. 4. It's all enterprisey sites. Same problem. Which sites in the whitelist are "enterprisey"? 5. We see what appears to be a community of users has found the tool, if that community of users also is associated with a whitelisted site. Since we don't tie users to sites we won't know if its a single user, or pair of users hitting the same site over and over to share, or multiple people in a "community" involved in generally the same use case. This is another type of bias introduced into the way the data will be reported. 6. We see some local sites develop popularity (many sites are local on the whitelist), and there is a regional interest or word-of-mouth growth happening. These local sites might also be defined in a way we currently do not understand (since the categories we construct are a byproduct of our cultural lens). Again, this suffers since we won't know if its one or a few individuals, or a broader popularity pattern emerging. The local international site content is admittedly not well thought out and currently set up in the whitelist so we could miss interesting patterns in this kind of usage altogether. 7. There's no patterns at all. That's another problem. No matter how scrabbled the data I can *guarantee* we will find patterns ;-) Whether the patterns are useful in leading to the desired outcome is another question. Remember this is the the problem we are trying to solve and the progression that needs to take place for this experiment to deliver something useful. selected sites visited by users => understanding of hello usage => plan to streamline popular interactions A likely pattern to emerge with a small user base and a newly implemented approach to how hello works would be for people to just be trying sharing on an example page. If we had counts of first run example page hits and total counts all sites shared by all users that would be the first step in gathering some non-private data that tells us how far along we have moved toward gaining some early adopters on a regular basis v. early user experimentation churn of just trying things out. Some data collection targeted at just that would be continually valuable. With this many missed connections in the data collection and categorization, biases in the data, and flaws in the execution of how the study will be administrated and reported its hard to see how the progression to the desired outcome will be reached.

chris hofmann

Comment 14

•

9 years ago

re: > What is still a blocker is how we are going to notify users about this and give them control. > Because this collection is much more intrusive usage metrics than we previously collected, > we need to expose explicit user control to disable this collection somewhere. > I would like to review how users will be able to accomplish that. there is a place where we collect site/URL data and send it back to mozilla servers currently and could serve as a good model and implementation for doing the things benjamin asks for here. That's with crash data reporting. That's an opt-in model for each 'event/crash' submission and asks users for permission and gives them an understanding of what data is about to be sent. From previous comments it sounds like Ian is adverse to this kind of implementation since it has the potential for introducing bias among the reporters that would be hard to sort out, but I'll let him comment on if a system like that would work. One other quick and dirty way to get some site data using existing sources would be to check to see if we have had any crashes/crash signatures during hello sharing activity and users that have opted in would have already sent us detailed URL data and we have it on storage in Socorro. The control part is also hard. Our crash URL recording does not allow for control once the submission has been made. Maybe it should.

chris hofmann

Comment 15

•

9 years ago

re: 7. There's no patterns at all. > That's another problem. No matter how scrabbled the data I can *guarantee* we will find patterns ;-) > Whether the patterns are useful in leading to the desired outcome is another question. Remember this is > the the problem we are trying to solve and the progression that needs to take place for this experiment > to deliver something useful. I filed Bug 1264718 - understand user churn v. dedicated and frequent users metrics for hello usage as a possible better place to start hello user pattern data collection. That's a possible scenario that would show up a no discernible user pattern but actually points out an interesting use case that we need to get more users into, and then advanced from. If we get a bunch of random sites from a large population first time experimenters it will be harder or maybe impossible flush out the interesting sites of a few more dedicated and frequent users. The current plan has a bias towards not helping us to understand this particular aspect of usage if new user churn rates are high.

Liz Henry (:lizzard) (privacy/fingerprinting team)

Comment 16

•

9 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #11) > What is still a blocker is how we are going to notify users about this and > give them control. Firefox Hello has a separate privacy policy which needs > to be updated. Marshall Erwin is responsible for certifying that the changes > to the privacy policy meet both legal and policy requirements. Because this > collection is much more intrusive usage metrics than we previously > collected, we need to expose explicit user control to disable this > collection somewhere. I would like to review how users will be able to > accomplish that. Should this block Firefox 46 release? If so, then we need to either update the privacy policy quickly, or uplift another version of the Hello system addon to back out the domain collection (which it sounds like we aren't planning to implement on 46 release anyway)

Flags: needinfo?(lmandel)

Flags: needinfo?(benjamin)

Liz Henry (:lizzard) (privacy/fingerprinting team)

Comment 17

•

9 years ago

Benjamin answered here: https://bugzilla.mozilla.org/show_bug.cgi?id=1211542#c34 Thanks! We should be sure to follow up with the privacy policy if this is planned to ship with the pref on.

Flags: needinfo?(lmandel)

Flags: needinfo?(benjamin)

Mark Banner (:standard8)

Comment 18

•

9 years ago

Support for Hello/Loop has been discontinued. https://support.mozilla.org/kb/hello-status Hence closing the old bugs. Thank you for your support.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → INCOMPLETE