932869 - Produce anonymized DB dumps from MDN production

Reporter

Description

•

11 years ago

For MDN, we want to come up with a solution for bug 665084:

1) Producing anonymized DB dumps that we can offer as downloads - ideally in public for core devs *and* contributors.

2) Updating MDN dev / stage periodically with a copy of scrubbed prod data.

Looks like SUMO had similar things done in bug 850644 and bug 816972

We have some SQL to scrub an MDN database here:

    https://github.com/mozilla/kuma/blob/master/scripts/anonymize.sql

That's going on a year old and will probably need some updates. But, we can revisit that, if the general approach is feasible.

Thanks!

Stephen Donner [:stephend] Not actively reading bugmail

Comment 1

•

11 years ago

We've kicked off a testday announcement, https://quality.mozilla.org/2013/10/web-qa-mdn-automation-test-day-november-15th/, and are seeing 404s all over dev; should be using stage, or can we fix dev?

Sheeri Cabral [:sheeri]

Assignee

Comment 2

•

11 years ago

What's your timeframe on this? Clearly before the 15th, but how much time do we actually have to do this? It may not take a long time, but we have to figure out how to prioritize it with what we already have going on.

Luke Crouch [:groovecoder]

Comment 3

•

11 years ago

#1 should be enough for us to do a manual copy to MDN Dev & Stage, so sometime before the 8th should give us enough time to try the manual move before the 15th

#2 is probably a nice-to-have that can wait until after

Les Orchard [:lorchard]

Reporter

Comment 4

•

11 years ago

(In reply to Stephen Donner [:stephend] from comment #1)
> We've kicked off a testday announcement,
> https://quality.mozilla.org/2013/10/web-qa-mdn-automation-test-day-november-
> 15th/, and are seeing 404s all over dev; should be using stage, or can we
> fix dev?

FWIW, neither dev nor stage for MDN have ever had reliable content on them since we launched Kuma. We've always had lots of 404s, if you follow links from the front page. We probably should have gotten this bug done before announcing a test day, but we've had it on the backburner for over a year :/

Les Orchard [:lorchard]

Reporter

Comment 5

•

11 years ago

(In reply to Luke Crouch [:groovecoder] from comment #3)
> #1 should be enough for us to do a manual copy to MDN Dev & Stage, so
> sometime before the 8th should give us enough time to try the manual move
> before the 15th

Who's the "we" that would do a manual copy? I don't think I have access to DBs on dev or stage to do an import.

> #2 is probably a nice-to-have that can wait until after

While I really, really don't want to see DB dump downloads wait another year, I suspect that setting up a cronjob to move data between databases would be easier than setting up downloads. But, if we can have all the above, it would be really great.

Les Orchard [:lorchard]

Reporter

Comment 6

•

11 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #2)
> What's your timeframe on this? Clearly before the 15th, but how much time do
> we actually have to do this? It may not take a long time, but we have to
> figure out how to prioritize it with what we already have going on.

(In reply to Les Orchard [:lorchard] from comment #5)
> While I really, really don't want to see DB dump downloads wait another
> year, I suspect that setting up a cronjob to move data between databases
> would be easier than setting up downloads. But, if we can have all the
> above, it would be really great.

In other words, I'd really rather we did this *right* than threw something together under pressure. Like I said, it's been waiting over a year for us to get around to it - let's not suddenly turn it into a fire drill now.

Sheeri Cabral [:sheeri]

Assignee

Comment 7

•

11 years ago

I will work on this today - and we'll do it the right way. We've had a busy week so I wanted to get a sense of urgency and timeframe so I could plan out when I was going to get to it. No worries!

Sheeri Cabral [:sheeri]

Assignee

Comment 8

•

11 years ago

What I'm doing, that we can script later on:

on the backup server (backup4):

# import the database
mysql-developer1  -e "create database devmo_sanitize"

gunzip /data/backups/developer1/sqldumps/developer_mozilla_org/developer_mozilla_org.2013.11.05.sql.gz 

mysql-developer1 devmo_sanitize < developer_mozilla_org.2013.11.05.sql

# sanitize
mysql-developer1 devmo_sanitize < anonymize.sql 

# export and compress (3.5G uncompressed, 455M compressed)
mysqldump -S /var/lib/mysql/developer1.sock  -pPASSWORD devmo_sanitize > devmo_sanitize.sql

gzip devmo_sanitize.sql 

# cleanup
mysql-developer1 -e "drop database devmo_sanitize"

# copy to stage
scp /var/tmp/devmo_sanitize.sql.gz stage1.db.scl3.mozilla.com:

# import on stage
gunzip devmo_sanitize.sql.gz 
mysql -e "create database if not exists mdn_sanitized_mozilla_com"
mysql mdn_sanitized_mozilla_com < devmo_sanitize.sql 

You should be all set for a one-off, and let me know if anything about this process should change. If we're good, we can automate the sanitization, copy and import into stage.

Les Orchard [:lorchard]

Reporter

Comment 9

•

11 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #8)
> You should be all set for a one-off, and let me know if anything about this
> process should change. If we're good, we can automate the sanitization, copy
> and import into stage.

That looks like exactly the process I had in mind. If that works for a one-off on stage, I say go right ahead.

A couple questions going forward with automation:

* Can we script this such that it grabs a fresh copy of anonymize.sql from github (or somewhere handy) before every run? That way, we could self-service updates to it if/when we add new sensitive tables.

* Could we have a copy of the sanitized dump dropped here with each run:

developer1.stage.webapp.scl3.mozilla.com:/data/www/developer.allizom.org/kuma/landfill/

Also, maybe delete any dumps that are over a week old. That would take care of our public download ask, since we also use that directory to serve file upload tarballs:

http://developer.allizom.org/landfill/

Thanks so much for looking at this!

Sheeri Cabral [:sheeri]

Assignee

Comment 10

•

11 years ago

Sadly, our backup servers can't read from the outside world. I wonder how much of a security risk that is.....will ask opssec.

Sheeri Cabral [:sheeri]

Assignee

Updated

•

11 years ago

Whiteboard: [2014q1]

Luke Crouch [:groovecoder]

Comment 11

•

11 years ago

+retornam for interest from https://bugzilla.mozilla.org/show_bug.cgi?id=665084#c16

:sheeri - I like the 2014q1 tag on this! ;) Will be great to have it!

Les Orchard [:lorchard]

Reporter

Comment 12

•

11 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #10)
> Sadly, our backup servers can't read from the outside world. I wonder how
> much of a security risk that is.....will ask opssec.

If it helps, we end up with an updated copy of anonymize.sql on our servers with every code push. If the backup servers could read it from there, that would also let us devs self-service the updates. You wouldn't need to pick it up from a 3rd party source

For example:

developer1.webapp.scl3.mozilla.com:/data/www/developer.mozilla.org/kuma/scripts/anonymize.sql
developer1.stage.webapp.scl3.mozilla.com:/data/www/developer.allizom.org/kuma/scripts/anonymize.sql

Sheeri Cabral [:sheeri]

Assignee

Comment 13

•

10 years ago

Started on the script today on backup3 in /data/backups/bin/sanitize.sh....currently it gets all the options needed and grabs the backup.

Sheeri Cabral [:sheeri]

Assignee

Comment 14

•

10 years ago

As for comment 12 - we don't have any netflows open to there either, but it might be easier to that that open than to get "from github"....

Sheeri Cabral [:sheeri]

Assignee

Comment 15

•

10 years ago



/data/backups/bin/devmo_sanitize.sh

/data/backups/bin/sanitize.sh -s /data/backups/bin/devmo_sanitize/devmo_sanitize.sh -i developer1 -o devmo_sanitize -d developer_mozilla_org

I've written a script that takes 4 parameters:
The sanitize script (this one runs anonymize.sql), the name of the source instance ("developer1" in this case), the name of the target instance (devmo_sanitize in this case), and the database name itself (developer_mozilla_org).

It uses the backup from the previous day in order to avoid any problems where the current day's backup is still running.

The script is up and working as-is.

Still need to copy it and remove old versions as per comment 9; I need a netflow to do that. And if I'm going to open a netflow to copy to developer1.stage.webapp.scl3.mozilla.com, I can copy the anonymize script from there...once the netflow is opened, that is.

Also need to change the hard coding of the db location, though that doesn't affect this particular instance.

Other than the netflow, this is imminent, with just a few small pieces left in place. 

Once this is all set up, how frequently do you want sanitized copies copied to stage...daily? weekly?

Les Orchard [:lorchard]

Reporter

Comment 16

•

10 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #15)

> Once this is all set up, how frequently do you want sanitized copies copied
> to stage...daily? weekly?

Just circling back to this... daily would be great, but weekly might be fine & less disruptive.

Sheeri Cabral [:sheeri]

Assignee

Comment 17

•

10 years ago

We can do daily, there's no worry about disruption - backups are being taken from a separate backup server, that's not in the load balancer/production loop. 

I *just* made the netflow yesterday (thought I already had, that's my fault) so as soon as that's opened we can have the sanitization and copy running daily.

Sheeri Cabral [:sheeri]

Assignee

Comment 18

•

10 years ago

for my info: the script is run on backup4, where the mdn (developer1) backups currently live (I said backup3 in comment 13). 

The netflow is open, but only from backup3, not backup4. Ugh. New netflow requested. (I also tweaked the script a bit because there were some syntax errors...will be able to fully test soon).

Sheeri Cabral [:sheeri]

Assignee

Updated

•

10 years ago

Whiteboard: [2014q1] → [2014q1] February

Sheeri Cabral [:sheeri]

Assignee

Updated

•

10 years ago

Whiteboard: [2014q1] February → [2014q1] March

Sheeri Cabral [:sheeri]

Assignee

Comment 19

•

10 years ago

So, I worked on this in the meantime but never updated the bug. timestamps show Feb 18-21 it was worked on. Running the script now, but it's already running once a month in cron.

Will fix any errors and report on them, and/or will notify if it's good.

Sheeri Cabral [:sheeri]

Assignee

Comment 20

•

10 years ago

Had to fix some permissions, but other than that, the script worked great!

A final run is in progress, and after that I have to write the copy back to the stage server.

Sheeri Cabral [:sheeri]

Assignee

Comment 21

•

10 years ago

Where on developer1.stage.webapp.scl3.mozilla.com would be good to drop the sanitized export?

Les Orchard [:lorchard]

Reporter

Comment 22

•

10 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #21)
> Where on developer1.stage.webapp.scl3.mozilla.com would be good to drop the
> sanitized export?

I think this would be a fine place:

    /data/www/developer.allizom.org/kuma/landfill/

That gets served up here:

    https://developer.allizom.org/landfill/

Sheeri Cabral [:sheeri]

Assignee

Comment 23

•

10 years ago

I ran the script today, and it worked! You can see the file at the URL.

Will leave this open until 4/1, when it's supposed to run automatically via cron.

Status: NEW → ASSIGNED

Whiteboard: [2014q1] March → [2014q1] March (check 4/1)

Les Orchard [:lorchard]

Reporter

Comment 24

•

10 years ago

Hmm, FYI: There was a dump when I checked a few days ago, but it seems to have since been deleted.

Sheeri Cabral [:sheeri]

Assignee

Comment 25

•

10 years ago

Odd, I'll run it manually right now, so the copy should get to the right place. The run takes about 10 mins if I recall correctly.

Sheeri Cabral [:sheeri]

Assignee

Comment 26

•

10 years ago

2014.03.26.sanitized.devmo_sanitize.sql.gz    100%  531MB  75.8MB/s   00:07    

real    5m7.457s
user    5m17.488s
sys     0m30.400s

Should be up there now.

Les Orchard [:lorchard]

Reporter

Comment 27

•

10 years ago

Hmm, dump disappeared again. Maybe the script that performs the attachment & uploads dumps is deleting everything? Trying to remember who set that up for us initially. Picking two people off the top of my head - sorry to bug you gents, but...

Flags: needinfo?(cturra)

Flags: needinfo?(bburton)

Chris Turra [:cturra]

Comment 28

•

10 years ago

(In reply to Les Orchard [:lorchard] from comment #22)
>
>     /data/www/developer.allizom.org/kuma/landfill/

this directory pulls from a git repo on the admin node during each deployment. so, if the source file(s) do not exist on the admin node, they will be removed during the next deploy. the correct longer term place to store this, would be in /data/developer-stage/src/developer.allizom.org/kuma/landfill/ on developeradm.private.scl3.

Flags: needinfo?(cturra)

Chris Turra [:cturra]

Updated

•

10 years ago

Flags: needinfo?(bburton)

Les Orchard [:lorchard]

Reporter

Comment 29

•

10 years ago

(In reply to Chris Turra [:cturra] from comment #28)
> the correct longer term place to store this, would be in
> /data/developer-stage/src/developer.allizom.org/kuma/landfill/ on
> developeradm.private.scl3.

:sheeri, is this something you can update in your scripts?

Flags: needinfo?(scabral)

Sheeri Cabral [:sheeri]

Assignee

Comment 30

•

10 years ago

Absolutely. we might need to wait for a netflow. Do you need an export in the meantime?

Flags: needinfo?(scabral)

Les Orchard [:lorchard]

Reporter

Comment 31

•

10 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #30)
> Absolutely. we might need to wait for a netflow. Do you need an export in
> the meantime?

I *think* we're okay for now, haven't really started depending on this being around yet or widely advertising it. 

FWIW, good news is that the handful of times some of us have downloaded it so far, it's seemed to work great in dev VMs! Thanks!

Sheeri Cabral [:sheeri]

Assignee

Comment 32

•

10 years ago

Glad it's working great for you so far! :D

Luke Crouch [:groovecoder]

Comment 33

•

10 years ago

We're advertising it more than we know I guess, because we already had a Q about it missing from https://developer.allizom.org/landfill/

So do I understand correctly that we lose the file after each deploy? I.e., until we update the script to copy the file to developeradm.private.scl3:/data/developer-stage/src/developer.allizom.org/kuma/landfill/ ?

Les Orchard [:lorchard]

Reporter

Comment 34

•

10 years ago

(In reply to Luke Crouch [:groovecoder] from comment #33)
> We're advertising it more than we know I guess, because we already had a Q
> about it missing from https://developer.allizom.org/landfill/

Yeah, we need to be careful about that. Don't point people at this yet. Until this bug and the one it blocks are closed, this feature is not to be considered "released" or reliable.

> So do I understand correctly that we lose the file after each deploy? I.e.,
> until we update the script to copy the file to
> developeradm.private.scl3:/data/developer-stage/src/developer.allizom.org/
> kuma/landfill/ ?

That's my understanding, yes.

Sheeri Cabral [:sheeri]

Assignee

Comment 35

•

10 years ago

I have committed the fix and am running the cron script manually, since it was set to run monthly on the 1st. If this works I'll change it to weekly on Sundays and we can close this bug after it works over the weekend. (or tweak more).

Sheeri Cabral [:sheeri]

Assignee

Comment 36

•

10 years ago

https://developer.allizom.org/landfill/ now shows devmo_sanitized-latest.sql.gz

Is this acceptable? If so, I've set this to run weekly on Mondays at 12:30 pm (in about 3.5 hours). I can check it later today and if it runs via cron we're all good!

Sheeri Cabral [:sheeri]

Assignee

Updated

•

10 years ago

Blocks: 993088

Sheeri Cabral [:sheeri]

Assignee

Comment 37

•

10 years ago

I now see:
	devmo_sanitized-latest.sql.gz	07-Apr-2014 12:35 	559M	 

so it looks like the automated sanitization works, mondays at 12:30 it will run from cron.

Please resolve if this looks good, we have bug 993088 open to monitor it so we'll know when it's not current.

Luke Crouch [:groovecoder]

Comment 38

•

10 years ago

Looks like this file survived the stage push, so this bug looks good to me. Thanks :sheeri!

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Luke Crouch [:groovecoder]

Comment 39

•

10 years ago

Sorry, the .sql.gz file disappeared from the staging server again. :(

https://developer.allizom.org/landfill/

Something must be blowing away this directory periodically?

Status: RESOLVED → REOPENED

Flags: needinfo?(scabral)

Flags: needinfo?(cturra)

Resolution: FIXED → ---

Sheeri Cabral [:sheeri]

Assignee

Comment 40

•

10 years ago

I'm copying to:
/data/developer-stage/src/developer.allizom.org/kuma/landfill/ on developeradm.private.scl3

as recommended by cturra in https://bugzilla.mozilla.org/show_bug.cgi?id=932869#c28

:(

Status: REOPENED → ASSIGNED

Flags: needinfo?(scabral)

Ben Bucksch (:BenB)

Comment 41

•

10 years ago

Maybe something else does a sync via "rsync -a --del" there?

Chris Turra [:cturra]

Comment 42

•

10 years ago

on our web clusters, /data/<cluster_name>/src/<app_name> is a repo, so no, nothing is deleted out of there by a process on our end. however, the update script might be wiping this directory during deploys.

  https://github.com/mozilla/kuma/blob/master/scripts/chief_deploy.py#L19

Flags: needinfo?(cturra)

Eric Shepherd [:sheppy]

Comment 43

•

10 years ago

(In reply to Chris Turra [:cturra] from comment #42)
> on our web clusters, /data/<cluster_name>/src/<app_name> is a repo, so no,
> nothing is deleted out of there by a process on our end. however, the update
> script might be wiping this directory during deploys.
> 
>   https://github.com/mozilla/kuma/blob/master/scripts/chief_deploy.py#L19

Can we get this resolved? We have several people with projects that are blocked by the lack of this dump.

Sheeri Cabral [:sheeri]

Assignee

Comment 44

•

10 years ago

Eric, 

I'm not sure if the update script is a dev thing or a webops thing, as neither of those are my purview, but is there anything I can do in the meantime to help the situation? I could put the sanitized backups weekly in google docs and share them with some folks, perhaps?

Les Orchard [:lorchard]

Reporter

Comment 45

•

10 years ago

Yeah, I'm confused where the issue is :/ 

My latest notion goes back to :cturra's comment #42. That is, our update of the code from github wipes out any untracked files - i.e. the DB backup. 

But, there are still files under /landfill/ from the filesystem exports, so it's not like *everything* gets wiped. Just the DB export. So, I'm confused.

Crazy idea: We have some space on Amazon S3 to host public files for contributors. Would it be possible to just push this DB dump out there, given the proper credentials?

Chris Turra [:cturra]

Comment 46

•

10 years ago

(In reply to Les Orchard [:lorchard] from comment #45)
> Crazy idea: We have some space on Amazon S3 to host public files for
> contributors. Would it be possible to just push this DB dump out there,
> given the proper credentials?

i think that'd be over complicating this situation. 

on 05/22 i touch[ed] a file in the kuma/landfill/ directory on stage and that file is still present. tho, there hasn't been a push since. could one of you do a push for me to test this out?

 [root@developeradm.private.scl3 developer.allizom.org]# ls -l kuma/landfill/cturra_testfile
 -rw-r--r-- 1 root root 0 May 22 12:59 kuma/landfill/cturra_testfile

Les Orchard [:lorchard]

Reporter

Comment 47

•

10 years ago

Hmm, looks like we did a push about 30 min ago, and the testfile is still there:

https://developer.allizom.org/landfill/

http://developeradm.private.scl3.mozilla.com/chief/developer.stage/logs/7e4d08f4e7e9e3eb163b46d00dd1c35f625f9a3d.1401220168

Sheeri Cabral [:sheeri]

Assignee

Comment 48

•

10 years ago

I'm running this by hand again, let's see if the file exists and remains there. I'll let you know when it's complete.

Sheeri Cabral [:sheeri]

Assignee

Comment 49

•

10 years ago

Looks like the file is still there:

devmo_sanitized-latest.sql.gz	27-May-2014 14:48 	607M	 

I wonder if the problem was resolved already? The sanitization is scheduled to run monthly on the 1st of the month; maybe it'd be a good idea to run the sanitization more frequently, like once every week on Sunday?

Sheeri Cabral [:sheeri]

Assignee

Comment 50

•

10 years ago

It's not there now. What happened between yesterday and today that might have cleared that file?

Luke Crouch [:groovecoder]

Comment 51

•

10 years ago

The only thing we've done from the dev side is to push code to production. But :cturra says there's nothing in that script that wipes the entire directory?

Flags: needinfo?(cturra)

Chris Turra [:cturra]

Comment 52

•

10 years ago

(In reply to Luke Crouch [:groovecoder] from comment #51)
> The only thing we've done from the dev side is to push code to production.
> But :cturra says there's nothing in that script that wipes the entire
> directory?

that's correct, plus, my test file *is* still there...

 [root@developeradm.private.scl3 developer.allizom.org]# ls -l kuma/landfill/cturra_testfile
 -rw-r--r-- 1 root root 0 May 22 12:59 kuma/landfill/cturra_testfile

Flags: needinfo?(cturra)

Chris Turra [:cturra]

Comment 53

•

10 years ago

woah! look at these cronjobs i just dug up..

# Puppet Name: landfill_archive_uploads
15 4 * * * cd /mnt/netapp/developer.mozilla.org-dekiwiki/; tar --xform="s:^demos:uploads:" -cvzf /mnt/netapp_stage/developer.allizom.org/landfill/uploads-`date +\%F`.tar.gz demos

# Puppet Name: landfill_archive_attachments
0 4 * * * cd /mnt/netapp/developer.mozilla.org/; tar -cvzf /mnt/netapp_stage/developer.allizom.org/landfill/attachments-`date +\%F`.tar.gz attachments

# Puppet Name: landfill_clean_old
0 5 * * * find /mnt/netapp_stage/developer.allizom.org/landfill -name "*.gz" -ctime +0 -exec rm {} \;


:sheeri - can you make that file a .tbz2? that would not be caught by this cleanup cron. this also explains why my test file was not removed!

Flags: needinfo?(scabral)

Sheeri Cabral [:sheeri]

Assignee

Comment 54

•

10 years ago

I can! It's not an archive, just a compressed SQL file, so it's a bz2 file. I've updated the script and it's running now, it should make:

devmo_sanitized-latest.sql.bz2

Flags: needinfo?(scabral)

Ben Bucksch (:BenB)

Comment 55

•

10 years ago

You might also want to fix or file a bug for these wild scripts, so that they don't bite the guy poor guy.

Ben Bucksch (:BenB)

Comment 56

•

10 years ago

... the next poor guy.

Chris Turra [:cturra]

Comment 57

•

10 years ago

(In reply to Ben Bucksch (:BenB) from comment #55)
> You might also want to fix or file a bug for these wild scripts, so that
> they don't bite the guy poor guy.

these crons have a purpose tho -- to keep the landfill directory cleaned up. removing these would fill the disks in short order.

 [root@developeradm.private.scl3 developer.allizom.org]# ls -lh kuma/landfill/
 total 6.1G
 -rw-r--r-- 1 root root 525M May 30 04:01 attachments-2014-05-30.tar.gz
 -rw-r--r-- 1 root root 366M May 30 07:22 devmo_sanitized-latest.sql.bz2
 -rw-r--r-- 1 root root 5.2G May 30 04:22 uploads-2014-05-30.tar.gz

Sheeri Cabral [:sheeri]

Assignee

Comment 58

•

10 years ago

It's up at https://developer.allizom.org/landfill/devmo_sanitized-latest.sql.bz2 - and it's about half the size of the gzip file was :D

Ben Bucksch (:BenB)

Comment 59

•

10 years ago

> these crons have a purpose tho

No doubt. I just meant to make them more specific to the files that should be cleaned up, and only those.

> https://developer.allizom.org/landfill/devmo_sanitized-latest.sql.bz2

Thanks! It's a 365 MB download, that's quite nice.

Les Orchard [:lorchard]

Reporter

Comment 60

•

10 years ago

Hmm, so if this dump keeps from disappearing, then maybe this bug is fixed at last...?

Luke Crouch [:groovecoder]

Comment 61

•

10 years ago

The file has been there for 3 whole days, which I think is a new record! And http://kuma.readthedocs.org/en/latest/installation-vagrant.html#developing-with-vagrant aleady links to the directory, so I'm calling this fixed! Thanks for your tenacity everyone!

Status: ASSIGNED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

pgm987

Comment 62

•

10 years ago

The SQL dump appears to have not been created. 

https://developer.allizom.org/landfill/devmo_sanitized-latest.sql.bz2 [404]
https://developer.allizom.org/landfill/ [Missing <date>.sanitized.devmo_sanitize.sql.gz]

Sheeri Cabral [:sheeri]

Assignee

Comment 63

•

10 years ago

Correct. The sanitization script failed last week, and we have put sanitizations on hold while we look over the code and put in verification to make sure that a sanitization failure will not expose any data.

Sheeri Cabral [:sheeri]

Assignee

Comment 64

•

10 years ago

Reopening, for better error checking.

The DB team got together and put in some checks into the bash wrapper scripts to make sure scripts that failed got picked up on.

We also came up with some verification ideas, for even more peace-of-mind. For example, let's say you blank out the "password" field of the "users" table, and run a 512-bit sha2 on the "login" field of the "user info" table. In MySQL, the sha2 algorithm spits out a 64-character string with only letters and numbers.

In this case, the verification queries would be:
SELECT COUNT(*) FROM users WHERE password<=>'';
SELECT COUNT(*) FROM user_info 
WHERE LENGTH(login)<=>64 
OR login NOT REGEXP '^[A-z0-9]*$';

What we'd like to do is have the anonymize SQL script output the sum of all the verification queries. The sum should be 0, which we will check in our script (but feel free to exit with an error code and a statement like:

"ERROR: There are $CNT entries for $col that are either not 64 chars long or have more than letters and numbers. This indicates that $col may not be anonymized appropriately"

Is that something you could do with the anonymize.sql script?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Luke Crouch [:groovecoder]

Comment 65

•

10 years ago

Thanks for the proposals! We will look at this after our post-mortem with Privacy and OpSec.

Sheeri Cabral [:sheeri]

Assignee

Updated

•

10 years ago

Whiteboard: [2014q1] March (check 4/1) → [2014q1] March [sanitization][incident]

Sheeri Cabral [:sheeri]

Assignee

Updated

•

10 years ago

Whiteboard: [2014q1] March [sanitization][incident] → [2014q1] March [data: sanitization][data: incident]

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → Data & BI Services Team

Sheeri Cabral [:sheeri]

Assignee

Comment 66

•

9 years ago

Current status is that we want to do sanitizations once a month and share with a limited audience, as per bug 1133709. So I'm resolving this bug.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 9 years ago

Resolution: --- → FIXED

Ben Bucksch (:BenB)

Comment 67

•

9 years ago

"Limited audience" doesn't meet the requirements. The idea was that it's public. The information in MDN is a public good and should thus be published in a form that's useable by any third party, not just selected people.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Luke Crouch [:groovecoder]

Comment 68

•

9 years ago

MDN's public information is publicly available in a tarball mirror format [1], and via 3rd parties like DevDocs, Dash, and Zeal as listed on our offline maintenance page. [2]

The MDN database contains PII which is not public. As a business and technical security measure around our users' PII, we are no longer producing regular, public database dumps of MDN. [3]


[1] https://bugzilla.mozilla.org/show_bug.cgi?id=757461
[2] https://developer.mozilla.org/media/maintenance/
[3] https://www.mozilla.org/en-US/privacy/

Status: REOPENED → RESOLVED

Closed: 9 years ago → 9 years ago

Resolution: --- → FIXED