Access to 2019 Kitsune Data
Categories
(support.mozilla.org :: General, enhancement, P1)
Tracking
(Not tracked)
People
(Reporter: Nukeador, Assigned: hmitsch)
Details
I'm kicking off a project to support the SuMo team in shaping a localization strategy moving forward. In order to do that I really need to understand the existing community and contributions and kitsune internal dashboards are not able to provide me with this information.
To get started on this, we need a data dump from Kitsune.
Do you need data publicly available data on support.mozilla.org and/or non-exposed / logged-in / admin data?
We need both public and admin data (no need for private messages).
Through this analysis we're aiming understand what our community is doing:
- What are they doing?
- Who is doing what? (activity levels)
- What are the trends in contribution and contributors? (localization coverage, top articles)
Is this an one-time request or you will need frequently updated data?
One-time request/dump.
In which format do you expect the data? sql dumps? csvs? Do you want to be able to run Kitsune in a local instance with the said data?
SQL access, they will be using something like Tableau or Jupyter notebooks to do the aggregate analysis.
Are the data going to be handled internally or forwarded to a third party?
We will be asking the work/analysis to be done on our servers by a contractor under NDA.
Will this contractor be the only person with access to the data and Raw data
Contractor + myself, SuMo core team, Henrik, Tasos
Aggregated analysis -- more employees as this will help inform strategy conversations/considerations
| Reporter | ||
Updated•7 years ago
|
Comment 1•7 years ago
|
||
:Nukeador you mention both One-time request/dump and SQL access above, are you looking for a single DB dump file that you can load somewhere else or direct access to our AWS RDS MySql instance?
| Reporter | ||
Comment 2•7 years ago
|
||
(In reply to Dave Parfitt [:metadave] from comment #1)
:Nukeador you mention both One-time request/dump and SQL access above, are you looking for a single DB dump file that you can load somewhere else or direct access to our AWS RDS MySql instance?
For this request a single DB dump file that we can load somewhere else, thanks!
Comment 3•7 years ago
|
||
:jbryner I wanted to loop in InfoSec to see if there are guidelines and/or best-practices that we should be adhering to while sharing out the SUMO database with an external contractor.
Comment 4•7 years ago
|
||
(In reply to Dave Parfitt [:metadave] from comment #3)
:jbryner I wanted to loop in InfoSec to see if there are guidelines and/or best-practices that we should be adhering to while sharing out the SUMO database with an external contractor.
Thanks for asking.
As far as I'm aware we still have a moratorium on sharing database dumps stemming from the MDN data leak. ( https://blog.mozilla.org/security/2014/08/01/mdn-database-disclosure/ )
It's hard to tell though if this is a dump to a file of the entire database, or standing up an instance that we control for adhoc queries?
In either case; this should get a pass through legal/privacy (Marshall/Alicia Gray) if it hasn't already for any privacy or legal concerns that may need to be accounted for as well as an approval to share (requirement of the MDN data leak).
If we get a go ahead; best case is that the data is held by Mozilla for querying in an environment we control using our standard SSO/2FA to enable access control, logging, etc.
Updated•7 years ago
|
| Assignee | ||
Comment 5•7 years ago
|
||
Hi Rubén,
can you please take care of the items mentioned by Jeff:
- pass through legal/privacy
Once this is done, we can probably do the same thing we did last time:
- Spin up a RDS instance on ParSys AWS
- Restict access to vendor's IP
Best regards,
Henrik
| Assignee | ||
Comment 6•7 years ago
|
||
Hi Dave,
Hey Jeff,
Sorry, this took a while ...
this should get a pass through legal/privacy (Marshall/Alicia Gray) if it hasn't already for any privacy or legal concerns that may need to be accounted for as well as an approval to share
We have an NDA in place with Analyse & Tal. This is the vendor engaged for data analysis. We did similar work in early 2018 and got approval from Legal & Trust. This is documented in https://biztera.com/projects/11465 (Mozilla Support Community and Contributors Data Analysis). I guess this means we are good on this item.
If we get a go ahead; best case is that the data is held by Mozilla for querying in an environment we control using our standard SSO/2FA to enable access control, logging, etc.
We will hold the data on the ParSys AWS account. Queries are done using Tableau. As far as I know, putting the IAM Proxy in front of the SQL endpoint would not work. What we did last year is:
- Use a randomly generated password
- Whitelist the IP from which Analyse & Tal will accesses the public endpoint
Unless Jeff objects, I suggest we move ahead. I am NI'ing him to see if he wants to chime in.
Best regards,
Henrik
Updated•7 years ago
|
Comment 7•7 years ago
|
||
I'll just chime in to say that i'm opposed to giving our user's personally identifiable information (usernames, emails, passwords, private messsages, etc.) to any 3rd party. I believe the only way forward here is to take a dump of the current production database, anonymize it, and then either give it to them in that form, or load it into an RDS instance specifically for their use. We have an anonymization script in the repo that we can run and check the result. I believe it's been a while since it's been used, but it should be sufficient for this purpose.
Comment 8•7 years ago
|
||
We'll use the anonymization script mentioned above to create a DB backup file that can be restored by ParSys in RDS.
| Reporter | ||
Comment 9•7 years ago
|
||
Paul, one of the goals I have is to know who is active and where. I'm not interested in personal information, but I'll need at least usernames as part of this analysis.
Comment 10•7 years ago
|
||
As you can see in the script, it does keep usernames:
https://github.com/mozilla/kitsune/blob/master/scripts/anonymize.sql#L6-L8
Comment 11•7 years ago
|
||
(In reply to Rubén Martín [:Nukeador] from comment #9)
Paul, one of the goals I have is to know who is active and where. I'm not interested in personal information, but I'll need at least usernames as part of this analysis.
Do you need the username or just a stand-in UUID to uniquely identify the entity?
In general I'd say granting access to a 3rd party means we should treat the data as if it could become accessible beyond the controls we put in place (whitelisted ip, dedicated instance, dedicated account) and we should look for offering the minimum data set that will answer the questions as well as operate within the privacy terms of service we agreed to with folks who have an account.
Do we have an inventory of the exploratory questions and what tables those line up to in the DB?
Comment 12•7 years ago
|
||
hi Jeff,
Yes the questions and requests can be seen here:
https://docs.google.com/document/d/1nPmbD1pz7FIOB8smSFUeWZhpG3SFDdu6jnZhaToRR00/edit?ts=5c6fe13c
Comment 13•7 years ago
|
||
(In reply to rina from comment #12)
hi Jeff,
Yes the questions and requests can be seen here:
https://docs.google.com/document/d/1nPmbD1pz7FIOB8smSFUeWZhpG3SFDdu6jnZhaToRR00/edit?ts=5c6fe13c
Great,thanks. Dave is that enough to know what tables/anonymization needs to be done for the export to the dedicated instance?
Comment 14•7 years ago
|
||
yes, we're all set. Anonymized DB backup shared with Tasos/Henrik.
Comment 15•7 years ago
|
||
also note, email addresses have been anonymized as part of our standard anonymization script.
Comment 16•7 years ago
|
||
I can confirm that I have received the backup. I am closing the bug as resolved.
| Assignee | ||
Comment 17•7 years ago
|
||
May I suggest keeping the bug open. I will talk to Marshall on Thursday. I am taking the bug on me and report back after my meeting.
Best regards,
Henrik
| Assignee | ||
Comment 18•7 years ago
|
||
Closing the bug. Marshall reinforced that it's important to have anonymized data and that the data we share qualifies as "classification: public", which it does. So we are good to go.
-Henrik
Description
•