I'd like to be able to provide researchers with a DB dump of the contents of MozTrap, provided I can do so in a way that's sane and sanitized to protect user data.
Sheeri: what's the policy on this type of thing? I know we'd have to wipe the email addresses, of course. But perhaps that's easy enough to do?
Yep, we'd have to sanitize the e-mails. That's not hard to do. We'd have to figure out the optimal way to do it - if e-mail isn't a primary/unique key, we can just set it to the empty string (''). If it is, we can do something like set the email to be something like 'userid@invalidemail'.
It looks like the only e-mail field is in the auth_user table, is that correct? If that's the case, there is no index on e-mail. We can sanitize that table, and for good measure, we can set both the email and passwords field to be blank.
Steps I'm taking: #### get the backup of the moztrap database cd /data/backups/generic/sqldumps/moztrap_mozilla_org gunzip moztrap_mozilla_org.2013.10.10.sql.gz result: -rwx------ 1 root root 333M Oct 10 12:22 moztrap_mozilla_org.2013.10.10.sql #### import it into a database mysql-generic -e "create database moztrap_sanitize" mysql-generic moztrap_sanitize < moztrap_mozilla_org.2013.10.10.sql (this took about 5 minutes) #### sanitize mysql-generic -vv -e "UPDATE moztrap_sanitize.auth_user SET email='', password='' WHERE 1=1" Showed 1057 rows changed. #### export and clean up mysqldump -p -S /var/lib/mysql/generic.sock moztrap_sanitize | gzip -c > moztrap_sanitize_2013_10_11.sql.gz (this took less than a minute) mysql-generic -e "drop database if exists moztrap_sanitize" Then just copy the file to where you need it. This is likely something we can automate. Please import the file I shared with you on Google Drive and make sure there's no other private/confidential information in there.
So, intuitively, I think there's research value in being able to group datasets by user, even if we can't identify that user directly; may I ask, can we replace user email addresses with something deterministically obscured? Say: md5( %bang_on_the_keyboard_for_a_moment + "email_address") @ invalidemail ? Thanks, - mhoye
That we can do for sure. We could update with an md5sum of the concatenated hash of email + password, that should be random enough, right? (and then erase the password) If the rest of the data checks out OK (please do make sure there's nothing else private there), I can redo the steps to reflect this new version, and then send the final result.
Have you checked out the rest of the data yet? I can redo the export with an md5sum of the concatenated hash of email + password once the rest of the data is verified.
Hi, is there any news on the data set? -- Bram
I was waiting for the verification of the rest of the data set before doing an md5sum of the concatenated hash of the e-mail plus password.
#### import mysql-generic -e "create database moztrap_sanitize" mysql-generic moztrap_sanitize < /data-2/backup2.db.phx1.mozilla.com/backups/generic/sqlcopies/moztrap_mozilla_org/moztrap_mozilla_org.2013.11.05.sql #### sanitize mysql-generic -vv -e "UPDATE moztrap_sanitize.auth_user SET email=md5(CONCAT(email,password)), password='' WHERE 1=1" -- Rows matched: 1118 Changed: 1118 Warnings: 0 #### export mysqldump -p -S /var/lib/mysql/generic.sock moztrap_sanitize | gzip -c > moztrap_sanitize_2013_11_06.sql.gz #### clean up mysql-generic -e "drop database if exists moztrap_sanitize" I have shared the result - a file called moztrap_sanitize_2013_11_06.sql.gz - in Google Drive in the "data share with mhoye" folder....Let me know if that works for you?
Assignee: nobody → scabral
If the data looks OK to you, I'll proceed making this a regular export, but with sha256 on the password, as it's more secure as per the review.
Looks fine to me. I think we could just drop the password part here completely if we were so inclined, and replace it with email=sha256(email+some_earlier_mashing_on_the_keyboard) if we were so inclined.
Promised for next week (week ending 20 Feb)
An update - I played with this script a bunch on Tuesday, and came back to it today. We are very close, I've been running it in an automated way each time, so the testing is taking longer but it's getting everything. I do believe this will be complete tomorrow, but of course stuff happens.
removing the old instance wasn't working, I was using ls | xargs. Changed to find -exec and trying again...
I am exceedingly close, I have one last bit of code to fiddle with (getting mysqldump to work how I want it to). Where did you want this export copied to, and with what frequency? (e.g. bugzilla sanitizations are done quarterly and uploaded to Google Drive).
Quarterly would be great, if that's quick and easy.
Script is complete and automated! Scheduled to run on the 1st of every month so all we have to do is upload the result to google drive quarterly. I have placed a sanitized file at 2014.02.20.sanitized.moztrap.sql.gz Please feel free to check it. The sanitization does: UPDATE auth_user SET email=sha2(CONCAT(email,password),512), password='' WHERE 1=1;
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
For our info, the export ends up in /data/moztrap_sanitize/ on the backup server for moztrap (generic), which currently is backup2 in phx.
The June sanitized backup (from the backup 5/31) is at https://docs.google.com/a/mozilla.com/file/d/0B_GP1OghOtJSaVhUMHdMbUlJcGc/
July sanitized backup (from the 6/30 backup) is at https://drive.google.com/a/mozilla.com/file/d/0B_GP1OghOtJScUI1T3hJUFRfZFU/edit?usp=sharing - it's in the folder where I share data with mhoye. Let me know if you need access.
August sanitized backup (from the 7/31 backup) is at https://docs.google.com/a/mozilla.com/file/d/0B_GP1OghOtJSZmplNkFRMkg4ZHM/edit
Great, would it be possible to provide access to the August backup in Google Drive or put it in Dropbox? Thanks in advance!
Whiteboard: [2014q1] February → [2014q1] February [export]
Whiteboard: [2014q1] February [export] → [2014q1] February [data: export]
You need to log in before you can comment on or make changes to this bug.