Closed Bug 840043 Opened 11 years ago Closed 10 years ago

Automate the process of providing sanitized Bugzilla dumps to external researchers

Categories

(bugzilla.mozilla.org :: General, defect)

Production
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: mhoye, Assigned: mhoye)

References

(Depends on 1 open bug)

Details

This is a tracking bug to let me keep an eye on how/when/if we can automate the process of providing sanitized Bugzilla dumps to external researchers and research groups.
Assignee: import-export → nobody
Component: Bug Import/Export & Moving → General
Product: Bugzilla → bugzilla.mozilla.org
QA Contact: default-qa
Version: unspecified → Production
https://wiki.mozilla.org/BMO/Meetings/2013-02-11 says:

"System to automatically, periodically produce sanitized db dumps is progressing well. gerv, mhoye, and glob discussed this. Data is dumped and then a week later is sanitized to remove bugs that have since been caught as security/confidential. email is not removed or anonymized since it is already public in BMO and users signing up are warned of this."

Could this not be done better by simply exporting all the data, then deleting all bugs and attachments newer than 1 week? This would avoid the week's delay.

Gerv
Either one works, in my opinion.
(Taking the bug. Duh.)
Assignee: nobody → mhoye
Status: NEW → ASSIGNED
(In reply to Gervase Markham [:gerv] from comment #1)
> Could this not be done better by simply exporting all the data, then
> deleting all bugs and attachments newer than 1 week? This would avoid the
> week's delay.

great idea, that would be simpler :)

we'd have to remove new comments too, due to private comments.
With everyone's consent, I'm going to put a Creative Commons Attribution 3.0 Unported license on this data once it's public.
Hmm. There is not an explicit license under which Bugzilla contributors contribute their content and work. In addition, it may be that some attachments of non-open-source code demonstrating a bug (for example) are attached for the purposes of debugging but the attacher did not intend to say that anyone can use the code for any purpose.

I think it would rather be better not to put an explicit licence on the data at all, i.e. do what Bugzilla does now.

Gerv
Depends on: 852130
(In reply to Gervase Markham [:gerv] from comment #1)
> email is not removed or anonymized since it is
> already public in BMO and users signing up are warned of this."

This is still very irritating! It's much more difficult to create an account + write a script which scans all bugs to collect reporters, CC members, assignees, QA contacts and voters than to simply download a copy of the bmo DB, run

  SELECT login_name FROM profiles;

and use 100,000+ email addresses very easily for spamming or any other purposes. You give bmo users no chance to prevent that.

Also, sanitizeme.pl contains the following comment:

# This SQL is designed to sanitize a copy of a Bugzilla database so that it
# doesn't contain any information that can't be viewed from a web browser by
# a user who is not logged in.

*who is NOT logged in*! This is not true as you seem to ignore that.


I don't see why you cannot randomize the domain name of the email address.
Randomizing the domain names in email is likely to dramatically reduce the usefulness of this data for research purposes, which is why we're putting it up in the first place.
(In reply to Mike Hoye [:mhoye] from comment #8)
> Randomizing the domain names in email is likely to dramatically reduce the
> usefulness of this data for research purposes

Why is it useful to know the domain name of an email address? If you are happy to keep @mozilla.org, @mozilla.com, @adobe.com, @microsoft.com, etc... domains public, feel free (though I'm not sure they are happy with your suddenly new policy), but I don't see the point to do the same with e.g. @gmail.com or @hotmail.com domains.
You are also lying at https://bugzilla.mozilla.org/createaccount.cgi :

"Bugzilla is a public place, so what you type and your email address will be visible to all logged-in users." <--- note the "logged-in" part of the sentence.

That's inaccurate. Your email address will also be accessible to all spammers downloading a copy of the bmo DB.
Accusations of lying and bad-faith decision-making pretty much guarantee that future discussions on this topic are not going to take a turn for the useful or constructive anytime soon.

We don't individually validate new Bugzilla accounts beyond checking that they can receive an email, so the barrier to scraping Bugzilla through the web or the API is virtually nonexistent. 

As part of https://bugzilla.mozilla.org/show_bug.cgi?id=848969 we will be making a small but trivially reversed change to the list of email addresses provided in these dumps. However, we're doing that to minimize the risk of a particular technical failure, not to defend against spammers.

As it stands, everyone signing up for BMO account for the last decade or so has been aware that this is a public-facing resource, and that with very few exceptions (HR, security bugs, etc) any information that participants type into it will be public-facing as well. The number of people who didn't know that going in is either zero or very close to it. 

Thank you for bringing the createaccount.cgi omission to our attention, I'll rewrite that wording to make sure our users know about it.
See Also: → 120738
Getting back to this - this is done, and this process is now automated.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
mhoye: that's awesome! For the record, can you give the URL where the dumps live?

Gerv
It seems to still be at http://people.mozilla.org/~mhoye/bugzilla/. Mike, is it hosted somewhere else or is it going to stay in your directory?
You need to log in before you can comment on or make changes to this bug.