1593268 - Limit access to lat/long fields of leanplum_sessions

Assignee

Description

•

6 years ago

This is basically a bug to cover how we want to disable access to the lat and lon fields of leanplum_sessions_v1 (currently only available in firefox_android_nightly_external).

My initial take is to:
0. Create a new dataset firefox_android_nightly_external_raw, which we use as the basis for the current tables flowing in

Lockdown access to firefox_android_external_raw.leanplum_sessions_v1
Create an authorized view on it in firefox_android_nightly_external, without those fields exposed. (I'd also create views on the rest of the tables there).

Results of this approach:

We would need {product}_{channel}_external_raw datasets for all the product/channels (good and bad - more datasets, but also more flexibility)
We could limit other PII we get in the future from external datasets (this approach lets us handle PII separately from import, since touching the data while it's in-flight to e.g. drop columns is error-prone)

We could drop those columns in-flight, but I'd rather avoid that if possible. (I can't disable them in the API request).

Wesley Dawson [:whd]

Comment 1

•

6 years ago

Currently we have an an external data GCS bucket and some bigquery datasets to house external tables into this bucket. It should be possible to define the columns explicitly when creating these external tables e.g. with https://cloud.google.com/bigquery/external-table-definition#creating_a_table_definition_using_a_json_schema_file which would limit the access to those columns from bigquery. Does this approach seam reasonable?

Frank Bertsch [:frank]

Assignee

Comment 2

•

6 years ago

(In reply to Wesley Dawson [:whd] from comment #1)

Currently we have an an external data GCS bucket and some bigquery datasets to house external tables into this bucket. It should be possible to define the columns explicitly when creating these external tables e.g. with https://cloud.google.com/bigquery/external-table-definition#creating_a_table_definition_using_a_json_schema_file which would limit the access to those columns from bigquery. Does this approach seam reasonable?

The downside with this approach is that we can no longer rely on the automated schema inference for the rest of the columns in that dataset, and unfortunately Leanplum doesn't guarantee a schema of any kind (so we'd end up inferring it ourselves).

We could:

Add the full table in BQ, with inferred schema
Get the schema of that table
Remove lat/lon from that schema
Recreate the table with the updated schema

Basically a workaround. If you'd rather I went this direction, that's fine.

Flags: needinfo?(whd)

Wesley Dawson [:whd]

Comment 3

•

6 years ago

(In reply to Frank Bertsch [:frank] from comment #2)

Basically a workaround. If you'd rather I went this direction, that's fine.

I'd prefer this, assuming it's a one-time thing (i.e. leanplum is not constantly changing their schema).

Flags: needinfo?(whd)

Frank Bertsch [:frank]

Assignee

Comment 4

•

6 years ago

•

Edited

(In reply to Wesley Dawson [:whd] from comment #3)

(In reply to Frank Bertsch [:frank] from comment #2)

Basically a workaround. If you'd rather I went this direction, that's fine.

I'd prefer this, assuming it's a one-time thing (i.e. leanplum is not constantly changing their schema).

Unfortunately this won't work. We're going to have to go for another workaround. EDIT: The issue is that you cannot create a schema that doesn't include all of the fields in the csv file. When you try to query the table, you get this error: Error while reading table: sandbox.test_data, error message: Too many values in row starting at position: 0.

One could be to no longer have these be external tables and instead load the data into BQ. Process would be:

Create external table
Add to BQ-native table as SELECT * EXCEPT from that table
Drop external table

Frank Bertsch [:frank]

Assignee

Comment 5

•

6 years ago

Wesley, I laid out another alternative but I would prefer what is mentioned in comment #1. That also aligns with what we're doing for telemetry data coming through the pipeline (views on top of raw data).

Flags: needinfo?(whd)

Wesley Dawson [:whd]

Comment 6

•

6 years ago

(In reply to Frank Bertsch [:frank] from comment #4)

One could be to no longer have these be external tables and instead load the data into BQ. Process would be:

Create external table

Add to BQ-native table as SELECT * EXCEPT from that table

Drop external table

I prefer this approach. Notably this temporary table should be created in the tmp dataset which is advertised as temporary space for loading parquet data on dtmo. The plan is to re-characterize that dataset as a staging area for datasets generally and to remove the broad read access that it currently has. In this manner we can align various types of loading that require staging without the proliferation of additional raw datasets.

Flags: needinfo?(whd)

Melissa O'Connor [:melissa]

Updated

•

6 years ago

Assignee: nobody → whd

Whiteboard: [DataOps]

Frank Bertsch [:frank]

Assignee

Comment 7

•

6 years ago

(In reply to Wesley Dawson [:whd] from comment #6)

(In reply to Frank Bertsch [:frank] from comment #4)

One could be to no longer have these be external tables and instead load the data into BQ. Process would be:

Create external table

Add to BQ-native table as SELECT * EXCEPT from that table

Drop external table

I prefer this approach. Notably this temporary table should be created in the tmp dataset which is advertised as temporary space for loading parquet data on dtmo. The plan is to re-characterize that dataset as a staging area for datasets generally and to remove the broad read access that it currently has. In this manner we can align various types of loading that require staging without the proliferation of additional raw datasets.

That should be fine. Downside here is schema changes, but I'll handle this if and when they happen.

Moving to Mobile and I'll take this.

Assignee: whd → fbertsch

Component: Operations → Datasets: Mobile

QA Contact: moconnor

Frank Bertsch [:frank]

Assignee

Comment 8

•

6 years ago

This has been updated and merged in https://github.com/mozilla/leanplum-data-export/pull/7.

:whd, it looks like the job is failing b/c of permissions issues with the tmp dataset:

[2019-11-20 14:55:27,521] {logging_mixin.py:95} INFO - [2019-11-20 14:55:27,521] {pod_launcher.py:104} INFO - google.api_core.exceptions.Forbidden: 403 POST https://bigquery.googleapis.com/bigquery/v2/projects/moz-fx-data-shared-prod/datasets/tmp/tables: Access Denied: Dataset moz-fx-data-shared-prod:tmp: User does not have bigquery.tables.create permission for dataset moz-fx-data-shared-prod:tmp.

Would you be able to update those?

Flags: needinfo?(whd)

Wesley Dawson [:whd]

Comment 9

•

6 years ago

I've filed https://github.com/mozilla-services/cloudops-infra/pull/1635 and will update the bug when it has been merged and deployed.

Flags: needinfo?(whd)

Wesley Dawson [:whd]

Comment 10

•

6 years ago

https://github.com/mozilla-services/cloudops-infra/pull/1635 has been applied, and an airflow rerun of the failed day succeeded.

Frank Bertsch [:frank]

Assignee

Comment 11

•

6 years ago

(In reply to Wesley Dawson [:whd] from comment #10)

https://github.com/mozilla-services/cloudops-infra/pull/1635 has been applied, and an airflow rerun of the failed day succeeded.

Perfect, let's close this out.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: Datasets: Mobile → General

Bugzilla

Limit access to lat/long fields of leanplum_sessions

Categories

(Data Platform and Tools :: General, enhancement)

Tracking

(Not tracked)

People

(Reporter: frank, Assigned: frank)

References

Details

(Whiteboard: [DataOps])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated