Closed Bug 932869 Opened 11 years ago Closed 9 years ago

Produce anonymized DB dumps from MDN production

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lorchard, Assigned: scabral)

References

Details

(Whiteboard: [2014q1] March [data: sanitization][data: incident])

For MDN, we want to come up with a solution for bug 665084:

1) Producing anonymized DB dumps that we can offer as downloads - ideally in public for core devs *and* contributors.

2) Updating MDN dev / stage periodically with a copy of scrubbed prod data.

Looks like SUMO had similar things done in bug 850644 and bug 816972

We have some SQL to scrub an MDN database here:

    https://github.com/mozilla/kuma/blob/master/scripts/anonymize.sql

That's going on a year old and will probably need some updates. But, we can revisit that, if the general approach is feasible.

Thanks!
We've kicked off a testday announcement, https://quality.mozilla.org/2013/10/web-qa-mdn-automation-test-day-november-15th/, and are seeing 404s all over dev; should be using stage, or can we fix dev?
What's your timeframe on this? Clearly before the 15th, but how much time do we actually have to do this? It may not take a long time, but we have to figure out how to prioritize it with what we already have going on.
#1 should be enough for us to do a manual copy to MDN Dev & Stage, so sometime before the 8th should give us enough time to try the manual move before the 15th

#2 is probably a nice-to-have that can wait until after
(In reply to Stephen Donner [:stephend] from comment #1)
> We've kicked off a testday announcement,
> https://quality.mozilla.org/2013/10/web-qa-mdn-automation-test-day-november-
> 15th/, and are seeing 404s all over dev; should be using stage, or can we
> fix dev?

FWIW, neither dev nor stage for MDN have ever had reliable content on them since we launched Kuma. We've always had lots of 404s, if you follow links from the front page. We probably should have gotten this bug done before announcing a test day, but we've had it on the backburner for over a year :/
(In reply to Luke Crouch [:groovecoder] from comment #3)
> #1 should be enough for us to do a manual copy to MDN Dev & Stage, so
> sometime before the 8th should give us enough time to try the manual move
> before the 15th

Who's the "we" that would do a manual copy? I don't think I have access to DBs on dev or stage to do an import.

> #2 is probably a nice-to-have that can wait until after

While I really, really don't want to see DB dump downloads wait another year, I suspect that setting up a cronjob to move data between databases would be easier than setting up downloads. But, if we can have all the above, it would be really great.
(In reply to Sheeri Cabral [:sheeri] from comment #2)
> What's your timeframe on this? Clearly before the 15th, but how much time do
> we actually have to do this? It may not take a long time, but we have to
> figure out how to prioritize it with what we already have going on.

(In reply to Les Orchard [:lorchard] from comment #5)
> While I really, really don't want to see DB dump downloads wait another
> year, I suspect that setting up a cronjob to move data between databases
> would be easier than setting up downloads. But, if we can have all the
> above, it would be really great.

In other words, I'd really rather we did this *right* than threw something together under pressure. Like I said, it's been waiting over a year for us to get around to it - let's not suddenly turn it into a fire drill now.
I will work on this today - and we'll do it the right way. We've had a busy week so I wanted to get a sense of urgency and timeframe so I could plan out when I was going to get to it. No worries!
What I'm doing, that we can script later on:

on the backup server (backup4):

# import the database
mysql-developer1  -e "create database devmo_sanitize"

gunzip /data/backups/developer1/sqldumps/developer_mozilla_org/developer_mozilla_org.2013.11.05.sql.gz 

mysql-developer1 devmo_sanitize < developer_mozilla_org.2013.11.05.sql

# sanitize
mysql-developer1 devmo_sanitize < anonymize.sql 

# export and compress (3.5G uncompressed, 455M compressed)
mysqldump -S /var/lib/mysql/developer1.sock  -pPASSWORD devmo_sanitize > devmo_sanitize.sql

gzip devmo_sanitize.sql 

# cleanup
mysql-developer1 -e "drop database devmo_sanitize"

# copy to stage
scp /var/tmp/devmo_sanitize.sql.gz stage1.db.scl3.mozilla.com:

# import on stage
gunzip devmo_sanitize.sql.gz 
mysql -e "create database if not exists mdn_sanitized_mozilla_com"
mysql mdn_sanitized_mozilla_com < devmo_sanitize.sql 

You should be all set for a one-off, and let me know if anything about this process should change. If we're good, we can automate the sanitization, copy and import into stage.
(In reply to Sheeri Cabral [:sheeri] from comment #8)
> You should be all set for a one-off, and let me know if anything about this
> process should change. If we're good, we can automate the sanitization, copy
> and import into stage.

That looks like exactly the process I had in mind. If that works for a one-off on stage, I say go right ahead.

A couple questions going forward with automation:

* Can we script this such that it grabs a fresh copy of anonymize.sql from github (or somewhere handy) before every run? That way, we could self-service updates to it if/when we add new sensitive tables.

* Could we have a copy of the sanitized dump dropped here with each run:

developer1.stage.webapp.scl3.mozilla.com:/data/www/developer.allizom.org/kuma/landfill/

Also, maybe delete any dumps that are over a week old. That would take care of our public download ask, since we also use that directory to serve file upload tarballs:

http://developer.allizom.org/landfill/

Thanks so much for looking at this!
Sadly, our backup servers can't read from the outside world. I wonder how much of a security risk that is.....will ask opssec.
Whiteboard: [2014q1]
+retornam for interest from https://bugzilla.mozilla.org/show_bug.cgi?id=665084#c16

:sheeri - I like the 2014q1 tag on this! ;) Will be great to have it!
(In reply to Sheeri Cabral [:sheeri] from comment #10)
> Sadly, our backup servers can't read from the outside world. I wonder how
> much of a security risk that is.....will ask opssec.

If it helps, we end up with an updated copy of anonymize.sql on our servers with every code push. If the backup servers could read it from there, that would also let us devs self-service the updates. You wouldn't need to pick it up from a 3rd party source

For example:

developer1.webapp.scl3.mozilla.com:/data/www/developer.mozilla.org/kuma/scripts/anonymize.sql
developer1.stage.webapp.scl3.mozilla.com:/data/www/developer.allizom.org/kuma/scripts/anonymize.sql
Started on the script today on backup3 in /data/backups/bin/sanitize.sh....currently it gets all the options needed and grabs the backup.
As for comment 12 - we don't have any netflows open to there either, but it might be easier to that that open than to get "from github"....


/data/backups/bin/devmo_sanitize.sh

/data/backups/bin/sanitize.sh -s /data/backups/bin/devmo_sanitize/devmo_sanitize.sh -i developer1 -o devmo_sanitize -d developer_mozilla_org

I've written a script that takes 4 parameters:
The sanitize script (this one runs anonymize.sql), the name of the source instance ("developer1" in this case), the name of the target instance (devmo_sanitize in this case), and the database name itself (developer_mozilla_org).

It uses the backup from the previous day in order to avoid any problems where the current day's backup is still running.

The script is up and working as-is.

Still need to copy it and remove old versions as per comment 9; I need a netflow to do that. And if I'm going to open a netflow to copy to developer1.stage.webapp.scl3.mozilla.com, I can copy the anonymize script from there...once the netflow is opened, that is.

Also need to change the hard coding of the db location, though that doesn't affect this particular instance.

Other than the netflow, this is imminent, with just a few small pieces left in place. 

Once this is all set up, how frequently do you want sanitized copies copied to stage...daily? weekly?
(In reply to Sheeri Cabral [:sheeri] from comment #15)

> Once this is all set up, how frequently do you want sanitized copies copied
> to stage...daily? weekly?

Just circling back to this... daily would be great, but weekly might be fine & less disruptive.
We can do daily, there's no worry about disruption - backups are being taken from a separate backup server, that's not in the load balancer/production loop. 

I *just* made the netflow yesterday (thought I already had, that's my fault) so as soon as that's opened we can have the sanitization and copy running daily.
for my info: the script is run on backup4, where the mdn (developer1) backups currently live (I said backup3 in comment 13). 

The netflow is open, but only from backup3, not backup4. Ugh. New netflow requested. (I also tweaked the script a bit because there were some syntax errors...will be able to fully test soon).
Whiteboard: [2014q1] → [2014q1] February
Whiteboard: [2014q1] February → [2014q1] March
So, I worked on this in the meantime but never updated the bug. timestamps show Feb 18-21 it was worked on. Running the script now, but it's already running once a month in cron.

Will fix any errors and report on them, and/or will notify if it's good.
Had to fix some permissions, but other than that, the script worked great!

A final run is in progress, and after that I have to write the copy back to the stage server.
Where on developer1.stage.webapp.scl3.mozilla.com would be good to drop the sanitized export?
(In reply to Sheeri Cabral [:sheeri] from comment #21)
> Where on developer1.stage.webapp.scl3.mozilla.com would be good to drop the
> sanitized export?

I think this would be a fine place:

    /data/www/developer.allizom.org/kuma/landfill/

That gets served up here:

    https://developer.allizom.org/landfill/
I ran the script today, and it worked! You can see the file at the URL.

Will leave this open until 4/1, when it's supposed to run automatically via cron.
Status: NEW → ASSIGNED
Whiteboard: [2014q1] March → [2014q1] March (check 4/1)
Hmm, FYI: There was a dump when I checked a few days ago, but it seems to have since been deleted.
Odd, I'll run it manually right now, so the copy should get to the right place. The run takes about 10 mins if I recall correctly.
2014.03.26.sanitized.devmo_sanitize.sql.gz    100%  531MB  75.8MB/s   00:07    

real    5m7.457s
user    5m17.488s
sys     0m30.400s

Should be up there now.
Hmm, dump disappeared again. Maybe the script that performs the attachment & uploads dumps is deleting everything? Trying to remember who set that up for us initially. Picking two people off the top of my head - sorry to bug you gents, but...
Flags: needinfo?(cturra)
Flags: needinfo?(bburton)
(In reply to Les Orchard [:lorchard] from comment #22)
>
>     /data/www/developer.allizom.org/kuma/landfill/

this directory pulls from a git repo on the admin node during each deployment. so, if the source file(s) do not exist on the admin node, they will be removed during the next deploy. the correct longer term place to store this, would be in /data/developer-stage/src/developer.allizom.org/kuma/landfill/ on developeradm.private.scl3.
Flags: needinfo?(cturra)
Flags: needinfo?(bburton)
(In reply to Chris Turra [:cturra] from comment #28)
> the correct longer term place to store this, would be in
> /data/developer-stage/src/developer.allizom.org/kuma/landfill/ on
> developeradm.private.scl3.

:sheeri, is this something you can update in your scripts?
Flags: needinfo?(scabral)
Absolutely. we might need to wait for a netflow. Do you need an export in the meantime?
Flags: needinfo?(scabral)
(In reply to Sheeri Cabral [:sheeri] from comment #30)
> Absolutely. we might need to wait for a netflow. Do you need an export in
> the meantime?

I *think* we're okay for now, haven't really started depending on this being around yet or widely advertising it. 

FWIW, good news is that the handful of times some of us have downloaded it so far, it's seemed to work great in dev VMs! Thanks!
Glad it's working great for you so far! :D
We're advertising it more than we know I guess, because we already had a Q about it missing from https://developer.allizom.org/landfill/

So do I understand correctly that we lose the file after each deploy? I.e., until we update the script to copy the file to developeradm.private.scl3:/data/developer-stage/src/developer.allizom.org/kuma/landfill/ ?
(In reply to Luke Crouch [:groovecoder] from comment #33)
> We're advertising it more than we know I guess, because we already had a Q
> about it missing from https://developer.allizom.org/landfill/

Yeah, we need to be careful about that. Don't point people at this yet. Until this bug and the one it blocks are closed, this feature is not to be considered "released" or reliable.

> So do I understand correctly that we lose the file after each deploy? I.e.,
> until we update the script to copy the file to
> developeradm.private.scl3:/data/developer-stage/src/developer.allizom.org/
> kuma/landfill/ ?

That's my understanding, yes.
I have committed the fix and am running the cron script manually, since it was set to run monthly on the 1st. If this works I'll change it to weekly on Sundays and we can close this bug after it works over the weekend. (or tweak more).
https://developer.allizom.org/landfill/ now shows devmo_sanitized-latest.sql.gz

Is this acceptable? If so, I've set this to run weekly on Mondays at 12:30 pm (in about 3.5 hours). I can check it later today and if it runs via cron we're all good!
Blocks: 993088
I now see:
	devmo_sanitized-latest.sql.gz	07-Apr-2014 12:35 	559M	 

so it looks like the automated sanitization works, mondays at 12:30 it will run from cron.

Please resolve if this looks good, we have bug 993088 open to monitor it so we'll know when it's not current.
Looks like this file survived the stage push, so this bug looks good to me. Thanks :sheeri!
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Sorry, the .sql.gz file disappeared from the staging server again. :(

https://developer.allizom.org/landfill/

Something must be blowing away this directory periodically?
Status: RESOLVED → REOPENED
Flags: needinfo?(scabral)
Flags: needinfo?(cturra)
Resolution: FIXED → ---
I'm copying to:
/data/developer-stage/src/developer.allizom.org/kuma/landfill/ on developeradm.private.scl3

as recommended by cturra in https://bugzilla.mozilla.org/show_bug.cgi?id=932869#c28

:(
Status: REOPENED → ASSIGNED
Flags: needinfo?(scabral)
Maybe something else does a sync via "rsync -a --del" there?
on our web clusters, /data/<cluster_name>/src/<app_name> is a repo, so no, nothing is deleted out of there by a process on our end. however, the update script might be wiping this directory during deploys.

  https://github.com/mozilla/kuma/blob/master/scripts/chief_deploy.py#L19
Flags: needinfo?(cturra)
(In reply to Chris Turra [:cturra] from comment #42)
> on our web clusters, /data/<cluster_name>/src/<app_name> is a repo, so no,
> nothing is deleted out of there by a process on our end. however, the update
> script might be wiping this directory during deploys.
> 
>   https://github.com/mozilla/kuma/blob/master/scripts/chief_deploy.py#L19

Can we get this resolved? We have several people with projects that are blocked by the lack of this dump.
Eric, 

I'm not sure if the update script is a dev thing or a webops thing, as neither of those are my purview, but is there anything I can do in the meantime to help the situation? I could put the sanitized backups weekly in google docs and share them with some folks, perhaps?
Yeah, I'm confused where the issue is :/ 

My latest notion goes back to :cturra's comment #42. That is, our update of the code from github wipes out any untracked files - i.e. the DB backup. 

But, there are still files under /landfill/ from the filesystem exports, so it's not like *everything* gets wiped. Just the DB export. So, I'm confused.

Crazy idea: We have some space on Amazon S3 to host public files for contributors. Would it be possible to just push this DB dump out there, given the proper credentials?
(In reply to Les Orchard [:lorchard] from comment #45)
> Crazy idea: We have some space on Amazon S3 to host public files for
> contributors. Would it be possible to just push this DB dump out there,
> given the proper credentials?

i think that'd be over complicating this situation. 

on 05/22 i touch[ed] a file in the kuma/landfill/ directory on stage and that file is still present. tho, there hasn't been a push since. could one of you do a push for me to test this out?

 [root@developeradm.private.scl3 developer.allizom.org]# ls -l kuma/landfill/cturra_testfile
 -rw-r--r-- 1 root root 0 May 22 12:59 kuma/landfill/cturra_testfile
I'm running this by hand again, let's see if the file exists and remains there. I'll let you know when it's complete.
Looks like the file is still there:

devmo_sanitized-latest.sql.gz	27-May-2014 14:48 	607M	 

I wonder if the problem was resolved already? The sanitization is scheduled to run monthly on the 1st of the month; maybe it'd be a good idea to run the sanitization more frequently, like once every week on Sunday?
It's not there now. What happened between yesterday and today that might have cleared that file?
The only thing we've done from the dev side is to push code to production. But :cturra says there's nothing in that script that wipes the entire directory?
Flags: needinfo?(cturra)
(In reply to Luke Crouch [:groovecoder] from comment #51)
> The only thing we've done from the dev side is to push code to production.
> But :cturra says there's nothing in that script that wipes the entire
> directory?

that's correct, plus, my test file *is* still there...

 [root@developeradm.private.scl3 developer.allizom.org]# ls -l kuma/landfill/cturra_testfile
 -rw-r--r-- 1 root root 0 May 22 12:59 kuma/landfill/cturra_testfile
Flags: needinfo?(cturra)
woah! look at these cronjobs i just dug up..

# Puppet Name: landfill_archive_uploads
15 4 * * * cd /mnt/netapp/developer.mozilla.org-dekiwiki/; tar --xform="s:^demos:uploads:" -cvzf /mnt/netapp_stage/developer.allizom.org/landfill/uploads-`date +\%F`.tar.gz demos

# Puppet Name: landfill_archive_attachments
0 4 * * * cd /mnt/netapp/developer.mozilla.org/; tar -cvzf /mnt/netapp_stage/developer.allizom.org/landfill/attachments-`date +\%F`.tar.gz attachments

# Puppet Name: landfill_clean_old
0 5 * * * find /mnt/netapp_stage/developer.allizom.org/landfill -name "*.gz" -ctime +0 -exec rm {} \;


:sheeri - can you make that file a .tbz2? that would not be caught by this cleanup cron. this also explains why my test file was not removed!
Flags: needinfo?(scabral)
I can! It's not an archive, just a compressed SQL file, so it's a bz2 file. I've updated the script and it's running now, it should make:

devmo_sanitized-latest.sql.bz2
Flags: needinfo?(scabral)
You might also want to fix or file a bug for these wild scripts, so that they don't bite the guy poor guy.
... the next poor guy.
(In reply to Ben Bucksch (:BenB) from comment #55)
> You might also want to fix or file a bug for these wild scripts, so that
> they don't bite the guy poor guy.

these crons have a purpose tho -- to keep the landfill directory cleaned up. removing these would fill the disks in short order.

 [root@developeradm.private.scl3 developer.allizom.org]# ls -lh kuma/landfill/
 total 6.1G
 -rw-r--r-- 1 root root 525M May 30 04:01 attachments-2014-05-30.tar.gz
 -rw-r--r-- 1 root root 366M May 30 07:22 devmo_sanitized-latest.sql.bz2
 -rw-r--r-- 1 root root 5.2G May 30 04:22 uploads-2014-05-30.tar.gz
It's up at https://developer.allizom.org/landfill/devmo_sanitized-latest.sql.bz2 - and it's about half the size of the gzip file was :D
> these crons have a purpose tho

No doubt. I just meant to make them more specific to the files that should be cleaned up, and only those.

> https://developer.allizom.org/landfill/devmo_sanitized-latest.sql.bz2

Thanks! It's a 365 MB download, that's quite nice.
Hmm, so if this dump keeps from disappearing, then maybe this bug is fixed at last...?
The file has been there for 3 whole days, which I think is a new record! And http://kuma.readthedocs.org/en/latest/installation-vagrant.html#developing-with-vagrant aleady links to the directory, so I'm calling this fixed! Thanks for your tenacity everyone!
Status: ASSIGNED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
The SQL dump appears to have not been created. 

https://developer.allizom.org/landfill/devmo_sanitized-latest.sql.bz2 [404]
https://developer.allizom.org/landfill/ [Missing <date>.sanitized.devmo_sanitize.sql.gz]
Correct. The sanitization script failed last week, and we have put sanitizations on hold while we look over the code and put in verification to make sure that a sanitization failure will not expose any data.
Reopening, for better error checking.

The DB team got together and put in some checks into the bash wrapper scripts to make sure scripts that failed got picked up on.

We also came up with some verification ideas, for even more peace-of-mind. For example, let's say you blank out the "password" field of the "users" table, and run a 512-bit sha2 on the "login" field of the "user info" table. In MySQL, the sha2 algorithm spits out a 64-character string with only letters and numbers.

In this case, the verification queries would be:
SELECT COUNT(*) FROM users WHERE password<=>'';
SELECT COUNT(*) FROM user_info 
WHERE LENGTH(login)<=>64 
OR login NOT REGEXP '^[A-z0-9]*$';

What we'd like to do is have the anonymize SQL script output the sum of all the verification queries. The sum should be 0, which we will check in our script (but feel free to exit with an error code and a statement like:

"ERROR: There are $CNT entries for $col that are either not 64 chars long or have more than letters and numbers. This indicates that $col may not be anonymized appropriately"

Is that something you could do with the anonymize.sql script?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Thanks for the proposals! We will look at this after our post-mortem with Privacy and OpSec.
Whiteboard: [2014q1] March (check 4/1) → [2014q1] March [sanitization][incident]
Whiteboard: [2014q1] March [sanitization][incident] → [2014q1] March [data: sanitization][data: incident]
Product: mozilla.org → Data & BI Services Team
Current status is that we want to do sanitizations once a month and share with a limited audience, as per bug 1133709. So I'm resolving this bug.
Status: REOPENED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → FIXED
"Limited audience" doesn't meet the requirements. The idea was that it's public. The information in MDN is a public good and should thus be published in a form that's useable by any third party, not just selected people.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
MDN's public information is publicly available in a tarball mirror format [1], and via 3rd parties like DevDocs, Dash, and Zeal as listed on our offline maintenance page. [2]

The MDN database contains PII which is not public. As a business and technical security measure around our users' PII, we are no longer producing regular, public database dumps of MDN. [3]


[1] https://bugzilla.mozilla.org/show_bug.cgi?id=757461
[2] https://developer.mozilla.org/media/maintenance/
[3] https://www.mozilla.org/en-US/privacy/
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.