Closed Bug 699530 Opened 13 years ago Closed 11 years ago

Match prod db with Mozillians.org dev and stage

Categories

(Participation Infrastructure :: Phonebook, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED
2013-11-07

People

(Reporter: aakashd, Assigned: sancus)

References

Details

(Whiteboard: [kb=816687][qa-])

We found an issue with groups due to a lack of information on our dev and stage db's. It'd be good to run a cron to match the prod db with dev and stage on a regular basis (daily or weekly is fine).
Will this data be sanitized? I'd recommend against doing this as syncing prod jeopardize user's data.
We would have to sanitize the data somehow, yes. We do it on Input with a script; I assume we could create something to do something similar here.
Should we move this over to mozilla.org/server operations?
Component: mozillians.org → Phonebook
Product: Websites → Community Tools
QA Contact: mozillians-org → phonebook
Version: unspecified → other
I think we need to do a quick technical analysis on this. 

Assumptions:
* that we do want a periodic or automated refresh of dev/staging database schemas from prod, plus enough production-like data to exercise the system with
* that the prod data include some private fields like location & email address, which were provided by users under the assumption that they'd be managed in a particular way
* that some volunteers will be participating in coding and qa on dev and staging environments at some point

Questions:
* How is Mozillians doing it now?
** how is automation done?
** how is authentication/authorization done?
** are data transformed in the copy?
* How are other Mozilla projects doing it?
** how is automation done?
** how is authentication/authorization done?
** are data transformed in the copy?
* How would Mozillians like to do it in the future?

:sancus said he'd take on the technical research and then hand it off to me for any necessary diligence.
Assignee: nobody → sancus
Target Milestone: --- → 2013-02-21
Target Milestone: 2013-02-21 → ---
We should run a cron to copy the prod db to staging & dev periodically(once per week? per 24 hours? Copies should be near-instant because the DB is pretty tiny). Staging has an anonymized db, but it does not appear to be updating, and dev has an ancient db that is not anonymized at all!

We need to avoid anonymizing admin+staff accounts because the logins are dependent on browserid, and changing the emails breaks it. 

I was thinking we could have a script that does the copying and anonymizing every, say, I dunno, week or so(or maybe even 24 hours, mozillians is a pretty tiny database) and has a query to check for accounts that have admin+staff flags, and NOT anonymize them.
That all sounds definitely do-able. Are we still waiting for :sancus to do the technical research?
(e.g. on what's anonymized and how)
Right now just email is anonymized, by changing the email username to a number with a simple counter. I think we should anonymize fullname as well.

For fullname, we could just use numbers, or we could do something fancy like use a simple name generator in python... numbers are probably sufficient, though. :p

I'll update this bug with the specific database fields necessary to determine admin+staff access as well.
Assignee: sancus → nobody
Whiteboard: [2012q2]April
This got stalled at the end of Feb, any desire to revive this? We can probably knock it out pretty quickly once we get moving on it.
Sheeri, we would like to do this. We have a bit of a sprint up until the end of the quarter. Can we ping you Jul. 1 or so?
Yes, that would work. We can work on this in q3.
Whiteboard: [2012q2]April → [2012q3]
Is this still wanted for q3?
Whiteboard: [2012q3] → [2013q3]
Flags: needinfo?(hoosteeno)
Sheeri, :sancus will work on this with you this quarter. Thanks for the reminder!
Flags: needinfo?(hoosteeno)
Whiteboard: [2013q3] → [2013q3][kb=816687]
Assignee: nobody → giorgos
Status: NEW → ASSIGNED
Assignee: giorgos → nobody
Status: ASSIGNED → NEW
Taking about the q3 goal, as there's no possible way this can happen before the end of the quarter. We can help work on it, but with 1 week left it will not be completed.
Whiteboard: [2013q3][kb=816687] → [kb=816687]
Thanks :sheeri. We will bring it up with you again when someone can work on it.
Suggested implementation:

* Schedule periodically: once a month?
* Use django-anonymizer
* Copy data from production, including administrators and API keys
* Exclude administrator accounts from anonymization
* Anonymize everything on dev, including images
* Anonymize email addresses on staging
sounds about right, do you have mechanisms in place to anonymize images?
django-anonymizer lets you write your own anonymizers. So we could pull from a small pool of them created one time from random search results or whatever.
Whiteboard: [kb=816687] → [kb=816687] [2013q4] November
So it sounds like devs would produce the code for sanitization, is that right? (this is what we do with bugzilla, the sanitization is provided by the bugzilla devs, and we restore a backup, run the sanitization, export and deliver the final product).
Yes that is right, we will provide sanitization code.

Thanks
Assignee: nobody → sancus
Status: NEW → ASSIGNED
Whiteboard: [kb=816687] [2013q4] November → [kb=816687]
The script we'll be using for this is PR'd here: https://github.com/mozilla/mozillians/pull/695/files

There is (rudimentary) documentation, but I'm mainly making a note so everyone on this bug is aware and can comment on the planned fields for anonymization.

https://github.com/Sancus/mozillians/blob/88a97dfb871705f1c474073862c78872f078f111/scripts/mysql-anonymize/anonymize_dev.yml are the fields for dev.

nullify means set to NULL, everything else is pretty much the way it sounds like. Random emails are ####@mozilla.com.
Just a note: from what I understand, the next step is in the DBA team's hands, to actually use this code to sanitize. Is that correct?
Commits pushed to master at https://github.com/mozilla/mozillians

https://github.com/mozilla/mozillians/commit/cbcc44bfeb0d8f7a5c6dddfe2c05f03abfebd120
[Bug 699530] Add base anonymize.py

https://github.com/mozilla/mozillians/commit/15b11a0512aecd1a5819f792007170009c348132
[Fix Bug 699530] Add anonymize configs for dev & stage, and fix up anonymize.py

https://github.com/mozilla/mozillians/commit/8b42fc27d12a4a84f3ac85a5d81aec79c92086a6
Merge pull request #695 from Sancus/mysql-anonymize

[Fix Bug 699530] add mysql anonymization script and configuration files
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Not technically FIXED yet. This should go live tomorrow, at which point this code will be on dev, stage, and prod, and the DBA team can use it.

There is one caveat - There's at least one table(south_migrationhistory) that we don't want copied between the servers because that describes the migration state of the database, and there may be others.

I was wondering if maybe the DBA team would want to keep the script that actually performs the copying in our repo, and then we can share responsibility for keeping it updated for environment stuff(you guys) and appropriate table exports/imports(us), or if there's some other preferred way to handle this coordination?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [kb=816687] → [kb=816687][qa-]
The scripts will likely include cron jobs that span multiple hosts (e.g. production backup host for anonymizing and export, dev + stage for importing) so I'm not sure how it will look.

Then again, it seems silly to have a cron in one repo, that calls a script that's kept in another repo. We could submit to the github repo, that's probably easiest.
Target Milestone: --- → 2013-11-07
:sheeri, I think you're right that the next step is for the DBA team to run this (and, as in comment 25, to set up a scheduling job and save it in a repo). Do you have any idea when we might expect it?

Thanks!
Flags: needinfo?(scabral)
Yeah, this should be reassigned to someone from the DBA side when there's time for it.
Is "by the end of November" acceptable?
Flags: needinfo?(scabral)
Considering that it took us just over 2 years to get the bug ready for DBAs, I guess 2 more weeks is OK... :)

Feel free to reassign if it helps track things.
creating the script now. Looks like we can wget, right now the script wgets just the anonymize.py script.

Should we also grab the _dev and _stage yml files? right now they're not configured the way we need them to be - the db on the backup server is called "mozillians_org". If that's ours to play with in github, let us know, otherwise we'll have this config file separate.
Changing the database names in the .xml files in github is no problem, feel free.
So I have the script getting a new copy of the anonymize.py script, taking the latest backup and importing it into sanitize_mozillians_org, and then running the anonymize.py script. However, it's not actually sanitizing the data....can we get a debug version where it prints out the statements it's running if we call it with -d?

Also, is there a file with connection information we're supposed to set somewhere? I can't clearly see where the authentication information to MySQL is stored, and so maybe that's why the script isn't working for me?
I guess I never linked the documentation directly in this bug(it's in the earlier patches), but it's here: http://mozillians.readthedocs.org/en/latest/mysql-anonymize.html

To make it short, the script doesn't directly connect to mysql. Printing the statements is actually how it works! You need to tell it where the .yml file is, like: `python anonymize.py anonymize_dev.yml` and it should print out all the SQL, which you can redirect to a text file or pipe directly into mysql or whatever you like.
Hah! I did see that, and I ended up outputting to a file....because that's what the example was. I'll continue to work on this tomorrow.
Verified the script works as intended. I have the script to anonymize running now, getting stage and dev exports, and will work on the copy part.
Awesome, thanks Sheeri! This is exciting. Our testing will be so much better on stage once this is complete.
FWIW, to import, sanitize and export mozillians_org twice (once for stage, once for dev) it takes 30 seconds.

Onto the copying!
copying complete, tested and pull request submitted.

As per a conversation today, there are some tasks/celery tables that should be dropped as part of the sanitization, the "username" field should be changed to "identifier" in the last stanza of the dev yml file, and the south_migrationhistory should be dropped, so I'm awaiting those happening in the script.
Commits pushed to master at https://github.com/mozilla/mozillians

https://github.com/mozilla/mozillians/commit/1adbf7f9348fa57d36e732fb81c6643e3a81be46
[Bug 699530] Sanitizer now drops tables that we don't want to copy to dev/stage

https://github.com/mozilla/mozillians/commit/4063c7c76eae5c718d46e37d25d75206f50def36
Merge pull request #733 from Sancus/drop-tables

[Bug 699530] Sanitizer now drops tables that we don't want to copy to dev/stage
07:45 < sancus> what happens when we change the db on dev and prod hasn't been updated yet
07:45 < sancus> that will make this start failing won't it
07:45 < sancus> or worse, breaking things
07:45 < sheeri> yeah, the latter
07:45 < sheeri> the export/import will still work.
07:46 < sancus> that's an issue
07:46 < sancus> Maybe we don't actually want this to be automatic on an interval
07:46 < sheeri> sancus: what if we dropped daily or weekly exports in a place you can reach and gave you the power 
                to import?
07:46 < sheeri> *nod*
07:46 < sancus> sheeri: That would be great
stage has been refreshed.
importing dev now.

One additional consideration for stage is copying the images over.
In case reversion is needed, backups are in /root/archive on dev1.db in phx.

To actually do the import:
on dev1.db in phx:
for dev:
mysql mozillians_dev_allizom_org < /data/backup-drop/generic/mozillians_org/mozillians_org.2013.11.27.sanitized_dev.sql

for stage:
mysql mozillians_stage < /data/backup-drop/generic/mozillians_org/mozillians_org.2013.11.27.sanitized_stage.sql

Images:
on genericadm:
rm -f /mnt/netapp_stage/mozillians.org/media/uploads/userprofile/*

and then:
cp /mnt/netapp/mozillians.org/media/uploads/userprofile/* /mnt/netapp_stage/mozillians.org/media/uploads/userprofile/
08:18 < sancus> sheeri: I suggest you just leave it on the actual servers and we file a separate bug to figure that 
                out, and until then we'll just ask you or a webops person to run the import script when needed

Next step: put mozillians-anonymize.sh into cron via puppet on backup2. Will have this run sanitization & copy weekly.
Blocks: 943923
This is done. Dev/stage dbs now contain their appropriately sanitized copies of prod data.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Thanks so much, sheeri!
added this to the backup cron for the generic server backups:

0 20 * * 3 root /data/backups/bin/mozillians-anonymize.sh 

In revision 78777.
Commits pushed to master at https://github.com/mozilla/mozillians

https://github.com/mozilla/mozillians/commit/b963c2e375628a55391b2591e3d6fada3ebee6b1
[bug 699530] Remove taskboard_* tables.

taskboard_* tables are leftovers of now removed code. We removed them
from our database, so now we can remove them from this script as well.

https://github.com/mozilla/mozillians/commit/8562a9f9d79b4a0fff5c239d780c74e63ff912b8
Merge pull request #736 from glogiotatidis/anon

[bug 699530] Remove taskboard_* tables.
Thank you :sheeri and :giorgos -- bumping to verified -- we have sanitized data on dev and stage
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.