create nightly dumps of buildhub's postgres and elasticsearch databases
Categories
(Cloud Services :: Operations: Miscellaneous, task)
Tracking
(Not tracked)
People
(Reporter: bhearsum, Unassigned)
Details
I'm trying to get Buildhub's local development story in better shape. One of the things that would make it much easier is to have up-to-date copies of the production data. As far as I know there's no secrets in either database, so if we could dump them on a nightly basis and make them available via a public, static URL it would incredibly helpful.
If either database is huge (let's say, > 1GB when heavily compressed) we could do selective dumps instead of full dumps.
Comment 1•6 years ago
|
||
This would be awesome. Incidentally if we did this it would be trivially easy to set up an airflow job to import the dataset into bigquery, which would be very useful for all sorts of data sciencey type things.
Hi :ulfr,
Could you comment from security ops perspective, would it be OK if we export and publish buildhub2's database on a daily basis?
Hi Ben,
A few questions and comments from ops perspective are as following:
- Which databases do you refer to? IIRC, there is only one database on the db instance, i.e. "buildhub2". Current size of the database is about 2.5GB. I suspect the gzip'd dump file may be smaller than that (Please clarify which database you want the dump for).
- I don't quite understand why the db dump is needed to be honest. All the data in buildhub2 comes from http://archive.mozilla.org/. Does buildhub2 have any data that http://archive.mozilla.org/ doesn't have? If we publish buildhub2's database, isn't it essentially just a second http://archive.mozilla.org/, only with different UI interface probably.
- Like you mentioned, currently buildhub2 database has no secrets in it. What I fear is that sometime in the future a secret is added to it, and people forget about there is a db dump job running, then the secret may get leaked accidentally. That's one of the reasons I usually do not recommend dumping and publishing -prod databases.
- If other services need the data from buildhub2, would it be possible for them to call an API of buildhub2 to download those, rather than rely on its db dumps? It feels more future and secure proof and modernized IMHO.
That said, if this is truly needed by dev team, and we can get buy-offs from all stake holders, I guess we can proceed with next steps.
Thanks.
Reporter | ||
Comment 3•6 years ago
|
||
(In reply to :wezhou from comment #2)
- Which databases do you refer to? IIRC, there is only one database on the db instance, i.e. "buildhub2". Current size of the database is about 2.5GB. I suspect the gzip'd dump file may be smaller than that (Please clarify which database you want the dump for).
I was hoping we could dump both the postgres and the elasticsearch database. The elasticsearch database is a bit less important, but still nice to have.
- I don't quite understand why the db dump is needed to be honest. All the data in buildhub2 comes from http://archive.mozilla.org/. Does buildhub2 have any data that http://archive.mozilla.org/ doesn't have? If we publish buildhub2's database, isn't it essentially just a second http://archive.mozilla.org/, only with different UI interface probably.
The main reason I'd like these is to make local development easier. while it's technically possible to build the local database by scraping archive.mozilla.org, it would take much longer to do that than importing a database dump (even if we only scrape a small portion of archive.mozilla.org). It's very convenient to be able to blow away and easily rebuild the local database to test various changes and scenarios, and being able to do so quickly and easily is a big benefit.
- Like you mentioned, currently buildhub2 database has no secrets in it. What I fear is that sometime in the future a secret is added to it, and people forget about there is a db dump job running, then the secret may get leaked accidentally. That's one of the reasons I usually do not recommend dumping and publishing -prod databases.
I totally understand this concern. Given that buildhub2 is really just an index of already publicly data, I think that's pretty unlikely though, unless the scope of it changes significantly.
- If other services need the data from buildhub2, would it be possible for them to call an API of buildhub2 to download those, rather than rely on its db dumps? It feels more future and secure proof and modernized IMHO.
I think this is a bit orthogonal, since the database dumps are intended to be used for local development (and possibly to do BigQuery imports - although we might want to do those in a more robust way).
Comment 4•6 years ago
|
||
(In reply to :wezhou from comment #2)
- Like you mentioned, currently buildhub2 database has no secrets in it. What I fear is that sometime in the future a secret is added to it, and people forget about there is a db dump job running, then the secret may get leaked accidentally. That's one of the reasons I usually do not recommend dumping and publishing -prod databases.
One mitigation here would be to dump only the specific tables related to the build information, rather than the entire db. It's not exactly the same, but when we made a read-only copy of the treeherder database we ran a quick RRA to make sure that there were no big security implications (see bug 1315398): could do the same here if there are concerns.
- If other services need the data from buildhub2, would it be possible for them to call an API of buildhub2 to download those, rather than rely on its db dumps? It feels more future and secure proof and modernized IMHO.
This is what is essentially happening now, but it's really not ideal and only works for specific applications. Mission Control, for example, uses the elastic search interface to import a subset of the data into its db:
https://github.com/mozilla/missioncontrol/blob/master/missioncontrol/etl/builds.py
Writing the above was a significant amount of effort (not easily replicable across services), and this approach doesn't work for adhoc queries via sql.telemetry.mozilla.org or similar.
Hi Ben,
The main reason I'd like these is to make local development easier.
For your purpose, can we just do a one time dump instead of daily dumps (thus without publishing)? Do you need not only the latest schema but also the latest data?
Hi William,
Thank you for your input. Let's wait and see if secops has any questions or comments on this.
Reporter | ||
Comment 6•6 years ago
|
||
(In reply to :wezhou from comment #5)
Hi Ben,
The main reason I'd like these is to make local development easier.
For your purpose, can we just do a one time dump instead of daily dumps (thus without publishing)? Do you need not only the latest schema but also the latest data?
Both are helpful. In other projects, I've had scenarios where I've needed up-to-date production data to reproduce issues - so a one time dump doesn't help in those scenarios.
To expand a bit more, having published dumps available means that we can set-up local development to bring up an entire environment with data by just running "docker-compose up", reset the database at will, etc. A publicly available dump also allows volunteers to more easily contribute to the project.
If it helps, this is something we already do for Balrog (https://storage.googleapis.com/balrog-prod-dbdump-v1/dump.sql.txt.xz)
Comment 7•5 years ago
|
||
We've been down the path of automated & sanitized db dumps to public locations for the purpose of local dev before. It didn't end well. https://blog.mozilla.org/security/2014/08/01/mdn-database-disclosure/
I second Wei's opinion here that a one-off dump is preferable. We shouldn't make this public, but only provide it to the devs who need it. I know this is inconvenient, but local dev really shouldn't rely on prod data, you should be able to mock or recreate the data. Debugging the occasional prod issue can use the one-off dump process as needed, but not automated.
With that being said, as Wiliam suggests, we could export specific tables & columns that have been reviewed, but we would need a process that guarantees the data in those exports never becomes sensitive, which is hard.
Reporter | ||
Comment 8•5 years ago
|
||
Alright, I'll find another path forward here.
Comment 9•5 years ago
|
||
(In reply to Julien Vehent from comment #7)
We've been down the path of automated & sanitized db dumps to public locations for the purpose of local dev before. It didn't end well. https://blog.mozilla.org/security/2014/08/01/mdn-database-disclosure/
I second Wei's opinion here that a one-off dump is preferable. We shouldn't make this public, but only provide it to the devs who need it. I know this is inconvenient, but local dev really shouldn't rely on prod data, you should be able to mock or recreate the data. Debugging the occasional prod issue can use the one-off dump process as needed, but not automated.
With that being said, as Wiliam suggests, we could export specific tables & columns that have been reviewed, but we would need a process that guarantees the data in those exports never becomes sensitive, which is hard.
Yeah, in retrospect I think this analysis makes sense. With respect to (only) making this data available in bigquery, I think creating a standalone job which does this is the best approach, that way we're not conflating that task with the (legitimate) concerns around keeping a production service's db private. I filed bug 1607229 about that.
Description
•