Closed Bug 1272388 Opened 9 years ago Closed 6 years ago

Allow databricks clusters to access the tiles redshift database

Categories

(Data Platform and Tools Graveyard :: Operations, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emtwo, Assigned: robotblake)

References

Details

(Whiteboard: [DataOps])

This includes setting up a JDBC driver and perhaps installing this library: https://github.com/databricks/spark-redshift
Depends on: 1270630
Points: --- → 2
Priority: -- → P2
Whiteboard: [SvcOps]
Blocks: 1291340
Is this till needed Marina?
Component: Metrics: Pipeline → Spark
Flags: needinfo?(msamuel)
Product: Cloud Services → Data Platform and Tools
It's a "nice to have" but not high priority.
Flags: needinfo?(msamuel)
Whiteboard: [SvcOps] → [DataOps]
Assignee: nobody → bimsland
Priority: P2 → P3
We're looking to wire this up to Databricks (but not ATMO). Steps would be as follows. 1. Create a databricks user in redshift with permissions for SELECT from the needed tables. 2. Create an internal NLB in the prod IAM with a target group pointing to the tiles redshift cluster. 3. Create an endpoint service in the prod IAM with a target of the internal NLB. 4. Create an endpoint in the databricks IAM with a target of the created endpoint service. 5. Approve the endpoint request in the prod IAM. 5. Test that connecting to the endpoint from databricks with the new credentials works as expected. The main reason to use an endpoint service is it lets us tunnel a particular port across AWS's internal network as opposed to setting up / maintaining a NAT that goes over the internet.
Summary: Allow spark cluster instances to access redshift databases → Allow spark cluster instances to access the tiles redshift database
Summary: Allow spark cluster instances to access the tiles redshift database → Allow databricks clusters to access the tiles redshift database
:bmiroglio - What tables do you need access to in Databricks? :ulfr - Can I get a r? from secops?
Flags: needinfo?(jvehent)
Flags: needinfo?(bmiroglio)
Flags: needinfo?(bmiroglio)
For now, I need access to ping_centre_main
I don't have operational experience with vpc endpoint services, but the documented security properties sounds good to me, so r+ (but please do call out any oddities you find while setting it up so we can learn).
Flags: needinfo?(jvehent)
:robotblake - Can you setup the VPC endpoint service? I'll create a new user account with access to the ping_centre_main table
Flags: needinfo?(bimsland)
Component: Spark → Operations
Flags: needinfo?(bimsland)
Priority: P3 → P1
QA Contact: moconnor
The endpoint service has been created.
Credentials sent
I'd like to request access to the `assa_router_events_daily` table. :jbuck can you add permissions to this table?
Flags: needinfo?(jbuckley)
Done
Flags: needinfo?(jbuckley)
Worked on wiring everything up on Friday but having some networking issues, working to resolve those now.
Alright, after fighting with this for another day I've opened a support request with AWS, I can get the endpoint service working when connected to the data IAM but not the databricks IAM, will update here when I hear from them.
Got the connections working, will throw up an example notebook and forward along.
:jbuck can you additionally add permissions to the `firefox_onboarding_events_daily` and `firefox_onboarding_events2_daily` tables?
Flags: needinfo?(jbuckley)
There's an example notebook using pyspark here: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/37716, let me know if you have any issues!
(In reply to Ben Miroglio [:bmiroglio] from comment #15) > :jbuck can you additionally add permissions to the > `firefox_onboarding_events_daily` and `firefox_onboarding_events2_daily` > tables? Done
Flags: needinfo?(jbuckley)
I'm going to close this but if you run into any other issues feel free to open another bug!
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
(from slack) jbuck added access to `assa_events_daily` in addition to the tables mentioned above.

Hi, it seems that this workaround for accessing tables in the Tiles DB isn't working anymore (was working previously).

seeing this in the errors: Caused by: org.postgresql.util.PSQLException: ERROR: relation "assa_impresson_stats_daily" does not exist

I know there was a reset of AWS credentials recently, is this related to it?

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

:shong - I think that's just a typo in the table name, missing the i in impression. Can you fix that and confirm?

Flags: needinfo?(shong)

Hi Jon,

Yes, you're right. I had a typo in that example, sorry about that!

So I'm able to access the tiles DBs using this methodology, but I'm running into issues when the data size gets moderately large (small enough that spark should normally be able to handle it easily). Example below:

I'm able to work around it for now by limiting my columns, but this could become a limiting issue sooner rather then later (a number of things related to Ridgeline like snippets data are contained in Tiles). We might need a long term solution that gives Databricks direct access to the database like we do with the regular Telemetry data

Thank you!

Flags: needinfo?(shong)

The real solution here is probably to migrate all of this data into BigQuery, :jbuck is that something we're looking at doing at some point?

Flags: needinfo?(jbuckley)

(In reply to Blake Imsland [:robotblake] from comment #23)

The real solution here is probably to migrate all of this data into BigQuery, :jbuck is that something we're looking at doing at some point?

Yes, we've already migrated part of the AS telemetry to BQ, the rest will be done shortly in Q3&Q4.

Flags: needinfo?(jbuckley)

Excellent! 👏

:shong - Lets work through this issue with Redshift in a new bug. I tried accessing that notebook, but it says I don't have View permissions on it

Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED

Excellent. Thank you everyone.

Yes, I see that it's restricted, oversight on my part.

  • Su

update: the Tiles access methodology provided by blake in comment 16 has been deprecated (the pw will not work).

use the new method using dbutils secrets provided below going forward:

Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.