Closed Bug 848969 Opened 11 years ago Closed 6 years ago

Make a virtual machine image of BMO available

Categories

(bugzilla.mozilla.org :: General, enhancement)

Production
enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: lizzard, Unassigned)

Details

I would love to have an image of bugzilla.mozilla.org with the latest public data dump, available for community contributors and for researchers, that we could run in VirtualBox or VMWare.  

It would be helpful to lower the barrier to entry for potential b.m.o. extension developers and people who might contribute patches.
As soon as I can get VirtualBox to not crash my system, I will create a Centos6 VM with anonymous BMO code pre-checkedout and running on a Sqlite database. Then I will create a Vagrant Box that can be used to get people up and running quickly.

http://www.vagrantup.com/
OS: Mac OS X → All
Hardware: x86 → All
justdave had the really good point that we should sanitize the email data in some way in case someone reconfigures it to send live mail instead of writing it to disk.
(In reply to Liz Henry :lizzard from comment #2)
> justdave had the really good point that we should sanitize the email data in
> some way in case someone reconfigures it to send live mail instead of
> writing it to disk.

This makes me think we should be sanitizing the email addresses as part of the sanitize script as well for all dumps. Maybe using some random hash string in replacement of the example.com part of the login. This way it can't be reused and also it protects against duplicate logins from using on the first part of the email.

dkl
(In reply to David Lawrence [:dkl] from comment #3) 
> This makes me think we should be sanitizing the email addresses as part of
> the sanitize script as well for all dumps. Maybe using some random hash
> string in replacement of the example.com part of the login. This way it
> can't be reused and also it protects against duplicate logins from using on
> the first part of the email.

Or simply setting the 'disable_mail' column to true in the profiles table but that could still be turned on per user by the person with the db dump. The hashed email addresses would still be safer IMO.

dkl
We had this debate when were discussing dropping the researchers' agreement policy for the dump we've already made public as well. Consensus was that these email addresses are already public-facing, so don't bother; obscuring the

Anyone who wants them can scrape them from the site, so obscuring them is basically like adding DRM layer to the data. A crappy inconvenience for honest researchers, no  barrier at all for people who wish to abuse their access.
Perhaps more precisely - I've already got legal, security and privacy's OK to include them as-is.
(In reply to Mike Hoye [:mhoye] from comment #5)
> We had this debate when were discussing dropping the researchers' agreement
> policy for the dump we've already made public as well. Consensus was that
> these email addresses are already public-facing, so don't bother; obscuring
> the
> 
> Anyone who wants them can scrape them from the site, so obscuring them is
> basically like adding DRM layer to the data. A crappy inconvenience for
> honest researchers, no  barrier at all for people who wish to abuse their
> access.

Yeah I am not against having the email addresses there but was just thinking of a way to lessen the risk of them actually having lots of email them mistakenly. So maybe just setting all users to having email disabled in the profiles table will be good enough for now.

dkl
Can you elaborate on what risk to our users you perceive, there? I don't follow.
Exactly -- I'm not concerned with the privacy aspect of it, or against including the emails. I'm worried about the "ease of accidentally turning on live bugmail in the configuration" issue.   

I can't see someone doing it on purpose, but it might be turned on by accident, which could generate a lot of bugmail for people from test instances.  For example if I were developing an extension that sent mail to people (as I in fact want to do!) inviting them to triage a random bug, I might turn on mail thinking my test bug would email only me. But it would email the giant list of people who get cced on every bug in that component.
(In reply to Mike Hoye [:mhoye] from comment #8)
> Can you elaborate on what risk to our users you perceive, there? I don't
> follow.

Mostly a technical issue, but if for some reason someone imports the DB to a 
working Bugzilla code checkout, and they accidently enable email delivery, alot of users could possibly be sent emails. They could think it was from BMO but I am sure they would be wondering why they have email from a source they are not aware of. It only takes a one line change in data/params to enable email delivery.

dkl
How about replacing the @ in the existing email addresses in the sanitized db with some other character or string, so that they aren't valid email addresses?  

That way, it isn't like deleting half the email address, but it would not send out accidental bugmail even if the option were (for testing, by the person who had downloaded the image and db) turned on the profile tables.
That way:
- the data would still be there (which I think is your concern, mhoye) 
- it could be easily restored
- the problem we're worried about would not happen
- Actual emails could still be put into the system by the user, for email testing.
(In reply to Liz Henry :lizzard from comment #12)
> That way:
> - the data would still be there (which I think is your concern, mhoye) 
> - it could be easily restored
> - the problem we're worried about would not happen
> - Actual emails could still be put into the system by the user, for email
> testing.

That works for me. 

dkl@mozilla.com => dkl_at_mozilla_dot_com

dkl
Works for me as well.

Thanks!
We should use a reversible transform, for the sake of the researchers. _ is a valid char in email localparts; so the following email addresses have clashing mappings:

gerv.markham@gerv.net
gerv_dot_markham@gerv.net

Can we simply replace the @ with _at_? That achieves the goal, and is reversible by looking for the first occurrence of "_at_" starting at the right hand end of the string, because it's guaranteed to be present but guaranteed not to be present in the domain name, as _ is not valid in Internet domain names.

Gerv
(In reply to Gervase Markham [:gerv] from comment #15)
> We should use a reversible transform, for the sake of the researchers.

if a reversible transform is used, i don't see how that would prevent spammers from applying the same transformation to get the correct addresses.

> Can we simply replace the @ with _at_?

no, that would result in the addresses being invalid according to bugzilla's emailregexp setting.
(In reply to Byron Jones ‹:glob› from comment #16)
> (In reply to Gervase Markham [:gerv] from comment #15)
> > We should use a reversible transform, for the sake of the researchers.
> 
> if a reversible transform is used, i don't see how that would prevent
> spammers from applying the same transformation to get the correct addresses.

This is not about anti-spam - see comments 5 and 6. It's about making sure people don't accidentally get email from test instances.

> > Can we simply replace the @ with _at_?
> 
> no, that would result in the addresses being invalid according to bugzilla's
> emailregexp setting.

But I am merely suggesting a reduced form of what's suggested and agreed in comment 13 and comment 14. How could that solution be OK but mine not?

Gerv
(In reply to Byron Jones ‹:glob› from comment #16)
> if a reversible transform is used, i don't see how that would prevent
> spammers from applying the same transformation to get the correct addresses.

I guess that is the point. Anything we do to change the email addresses can easily be reversed by the spammer so there is no point in doing it. 

Or if we are worried about spammers, we replace the domain portion with a randomly generated key of some kind.
 
> > Can we simply replace the @ with _at_?
> 
> no, that would result in the addresses being invalid according to bugzilla's
> emailregexp setting.

That can be fixed in the regex in data/params value installed on the VM.

(In reply to Gervase Markham [:gerv] from comment #17)
> This is not about anti-spam - see comments 5 and 6. It's about making sure
> people don't accidentally get email from test instances.
> 

LpSolit would disagree with this point. See bug 840043. If just altering the email addresses to be unusable is not that difficult, why not just do it? Do the researchers really need valid emails for their metrics to come out right?

dkl
(In reply to Gervase Markham [:gerv] from comment #17)
> This is not about anti-spam - see comments 5 and 6. It's about making sure
> people don't accidentally get email from test instances.

oh, sorry gerv.  there's another very similar discussion happening about this regarding anti-spam.

the sanitise script has already been updated in bug 855846 to address that issue -- bugmail is simply disabled for all accounts, and flag-cc's removed.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
Apparently I resolved this ~2 years ago with vagrant. :-)
Resolution: INCOMPLETE → FIXED
You need to log in before you can comment on or make changes to this bug.