Looking for a DB dump from the Moztrap server

RESOLVED FIXED

Status

--
minor
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: mhoye, Assigned: scabral)

Tracking

Details

(Whiteboard: [2014q1] February [data: export])

(Reporter)

Description

5 years ago
I'd like to be able to provide researchers with a DB dump of the contents of MozTrap, provided I can do so in a way that's sane and sanitized to protect user data.
Sheeri: what's the policy on this type of thing?  I know we'd have to wipe the email addresses, of course.  But perhaps that's easy enough to do?
(Assignee)

Comment 2

5 years ago
Yep, we'd have to sanitize the e-mails. That's not hard to do. We'd have to figure out the optimal way to do it - if e-mail isn't a primary/unique key, we can just set it to the empty string (''). If it is, we can do something like set the email to be something like 'userid@invalidemail'.
(Assignee)

Comment 3

5 years ago
It looks like the only e-mail field is in the auth_user table, is that correct?

If that's the case, there is no index on e-mail. We can sanitize that table, and for good measure, we can set both the email and passwords field to be blank.
(Assignee)

Comment 4

5 years ago
Steps I'm taking:

#### get the backup of the moztrap database
cd /data/backups/generic/sqldumps/moztrap_mozilla_org
gunzip moztrap_mozilla_org.2013.10.10.sql.gz 

result:
-rwx------ 1 root root 333M Oct 10 12:22 moztrap_mozilla_org.2013.10.10.sql

#### import it into a database
mysql-generic -e "create database moztrap_sanitize"
mysql-generic moztrap_sanitize < moztrap_mozilla_org.2013.10.10.sql 

(this took about 5 minutes)

#### sanitize
mysql-generic -vv -e "UPDATE moztrap_sanitize.auth_user SET email='', password='' WHERE 1=1"

Showed 1057 rows changed.

#### export and clean up
mysqldump -p -S /var/lib/mysql/generic.sock  moztrap_sanitize | gzip -c > moztrap_sanitize_2013_10_11.sql.gz

(this took less than a minute)

mysql-generic -e "drop database if exists moztrap_sanitize"

Then just copy the file to where you need it.

This is likely something we can automate. Please import the file I shared with you on Google Drive and make sure there's no other private/confidential information in there.
(Reporter)

Comment 5

5 years ago
So, intuitively, I think there's research value in being able to group datasets by user, even if we can't identify that user directly; may I ask, can we replace user email addresses with something deterministically obscured?

Say: md5( %bang_on_the_keyboard_for_a_moment + "email_address") @ invalidemail 

?

Thanks,

- mhoye
(Assignee)

Comment 6

5 years ago
That we can do for sure. We could update with an md5sum of the concatenated hash of email + password, that should be random enough, right? (and then erase the password)

If the rest of the data checks out OK (please do make sure there's nothing else private there), I can redo the steps to reflect this new version, and then send the final result.
(Assignee)

Comment 7

5 years ago
Have you checked out the rest of the data yet? I can redo the export with an md5sum  of the concatenated hash of email + password once the rest of the data is verified.

Comment 8

5 years ago
Hi, is there any news on the data set? -- Bram
(Assignee)

Comment 9

5 years ago
I was waiting for the verification of the rest of the data set before doing an md5sum of the concatenated hash of the e-mail plus password.
(Assignee)

Comment 10

5 years ago
#### import
mysql-generic -e "create database moztrap_sanitize"

mysql-generic moztrap_sanitize < /data-2/backup2.db.phx1.mozilla.com/backups/generic/sqlcopies/moztrap_mozilla_org/moztrap_mozilla_org.2013.11.05.sql 

#### sanitize
mysql-generic -vv -e "UPDATE moztrap_sanitize.auth_user SET email=md5(CONCAT(email,password)), password='' WHERE 1=1"

-- Rows matched: 1118  Changed: 1118  Warnings: 0


#### export
mysqldump -p -S /var/lib/mysql/generic.sock  moztrap_sanitize | gzip -c > moztrap_sanitize_2013_11_06.sql.gz


#### clean up
mysql-generic -e "drop database if exists moztrap_sanitize"

I have shared the result - a file called moztrap_sanitize_2013_11_06.sql.gz - in Google Drive in the "data share with mhoye" folder....Let me know if that works for you?
Assignee: nobody → scabral
Flags: needinfo?(mhoye)
(Assignee)

Comment 11

5 years ago
If the data looks OK to you, I'll proceed making this a regular export, but with sha256 on the password, as it's more secure as per the review.
(Reporter)

Comment 12

5 years ago
Looks fine to me. I think we could just drop the password part here completely if we were so inclined, and replace it with email=sha256(email+some_earlier_mashing_on_the_keyboard) if we were so inclined.
Flags: needinfo?(mhoye)
(Assignee)

Updated

5 years ago
Whiteboard: [2014q1]
(Assignee)

Updated

5 years ago
Whiteboard: [2014q1] → [2014q1] February
(Assignee)

Comment 14

5 years ago
Promised for next week (week ending 20 Feb)
(Assignee)

Comment 15

5 years ago
An update - I played with this script a bunch on Tuesday, and came back to it today. We are very close, I've been running it in an automated way each time, so the testing is taking longer but it's getting everything.

I do believe this will be complete tomorrow, but of course stuff happens.
(Assignee)

Comment 16

5 years ago
removing the old instance wasn't working, I was using ls | xargs. Changed to find -exec and trying again...
(Assignee)

Comment 17

5 years ago
I am exceedingly close, I have one last bit of code to fiddle with (getting mysqldump to work how I want it to). 

Where did you want this export copied to, and with what frequency? (e.g. bugzilla sanitizations are done quarterly and uploaded to Google Drive).
(Reporter)

Comment 18

5 years ago
Quarterly would be great, if that's quick and easy.
(Assignee)

Comment 19

5 years ago
Script is complete and automated! Scheduled to run on the 1st of every month so all we have to do is upload the result to google drive quarterly.

I have placed a sanitized file at 2014.02.20.sanitized.moztrap.sql.gz

Please feel free to check it.

The sanitization does:
UPDATE auth_user SET email=sha2(CONCAT(email,password),512), password='' WHERE 1=1;
(Assignee)

Updated

5 years ago
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
(Assignee)

Comment 20

4 years ago
For our info, the export ends up in /data/moztrap_sanitize/ on the backup server for moztrap (generic), which currently is backup2 in phx.
(Assignee)

Comment 21

4 years ago
The June sanitized backup (from the backup 5/31) is at https://docs.google.com/a/mozilla.com/file/d/0B_GP1OghOtJSaVhUMHdMbUlJcGc/
(Assignee)

Comment 22

4 years ago
July sanitized backup (from the 6/30 backup) is at https://drive.google.com/a/mozilla.com/file/d/0B_GP1OghOtJScUI1T3hJUFRfZFU/edit?usp=sharing - it's in the folder where I share data with mhoye. Let me know if you need access.
(Assignee)

Comment 23

4 years ago
August sanitized backup (from the 7/31 backup) is at https://docs.google.com/a/mozilla.com/file/d/0B_GP1OghOtJSZmplNkFRMkg4ZHM/edit

Comment 24

4 years ago
Great, would it be possible to provide access to the August backup in Google Drive or put it in Dropbox? Thanks in advance!
(Assignee)

Updated

4 years ago
Whiteboard: [2014q1] February → [2014q1] February [export]
(Assignee)

Updated

4 years ago
Whiteboard: [2014q1] February [export] → [2014q1] February [data: export]
You need to log in before you can comment on or make changes to this bug.