Limit access to lat/long fields of leanplum_sessions
Categories
(Data Platform and Tools :: General, enhancement)
Tracking
(Not tracked)
People
(Reporter: frank, Assigned: frank)
Details
(Whiteboard: [DataOps])
This is basically a bug to cover how we want to disable access to the lat
and lon
fields of leanplum_sessions_v1
(currently only available in firefox_android_nightly_external
).
My initial take is to:
0. Create a new dataset firefox_android_nightly_external_raw
, which we use as the basis for the current tables flowing in
- Lockdown access to
firefox_android_external_raw.leanplum_sessions_v1
- Create an authorized view on it in
firefox_android_nightly_external
, without those fields exposed. (I'd also create views on the rest of the tables there).
Results of this approach:
- We would need
{product}_{channel}_external_raw
datasets for all the product/channels (good and bad - more datasets, but also more flexibility) - We could limit other PII we get in the future from external datasets (this approach lets us handle PII separately from import, since touching the data while it's in-flight to e.g. drop columns is error-prone)
We could drop those columns in-flight, but I'd rather avoid that if possible. (I can't disable them in the API request).
Comment 1•6 years ago
|
||
Currently we have an an external data GCS bucket and some bigquery datasets to house external tables into this bucket. It should be possible to define the columns explicitly when creating these external tables e.g. with https://cloud.google.com/bigquery/external-table-definition#creating_a_table_definition_using_a_json_schema_file which would limit the access to those columns from bigquery. Does this approach seam reasonable?
Assignee | ||
Comment 2•6 years ago
|
||
(In reply to Wesley Dawson [:whd] from comment #1)
Currently we have an an external data GCS bucket and some bigquery datasets to house external tables into this bucket. It should be possible to define the columns explicitly when creating these external tables e.g. with https://cloud.google.com/bigquery/external-table-definition#creating_a_table_definition_using_a_json_schema_file which would limit the access to those columns from bigquery. Does this approach seam reasonable?
The downside with this approach is that we can no longer rely on the automated schema inference for the rest of the columns in that dataset, and unfortunately Leanplum doesn't guarantee a schema of any kind (so we'd end up inferring it ourselves).
We could:
- Add the full table in BQ, with inferred schema
- Get the schema of that table
- Remove lat/lon from that schema
- Recreate the table with the updated schema
Basically a workaround. If you'd rather I went this direction, that's fine.
Comment 3•6 years ago
|
||
(In reply to Frank Bertsch [:frank] from comment #2)
Basically a workaround. If you'd rather I went this direction, that's fine.
I'd prefer this, assuming it's a one-time thing (i.e. leanplum is not constantly changing their schema).
Assignee | ||
Comment 4•6 years ago
•
|
||
(In reply to Wesley Dawson [:whd] from comment #3)
(In reply to Frank Bertsch [:frank] from comment #2)
Basically a workaround. If you'd rather I went this direction, that's fine.
I'd prefer this, assuming it's a one-time thing (i.e. leanplum is not constantly changing their schema).
Unfortunately this won't work. We're going to have to go for another workaround. EDIT: The issue is that you cannot create a schema that doesn't include all of the fields in the csv file. When you try to query the table, you get this error: Error while reading table: sandbox.test_data, error message: Too many values in row starting at position: 0.
One could be to no longer have these be external tables and instead load the data into BQ. Process would be:
- Create external table
- Add to BQ-native table as
SELECT * EXCEPT
from that table - Drop external table
Assignee | ||
Comment 5•6 years ago
|
||
Wesley, I laid out another alternative but I would prefer what is mentioned in comment #1. That also aligns with what we're doing for telemetry data coming through the pipeline (views on top of raw data).
Comment 6•6 years ago
|
||
(In reply to Frank Bertsch [:frank] from comment #4)
One could be to no longer have these be external tables and instead load the data into BQ. Process would be:
- Create external table
- Add to BQ-native table as
SELECT * EXCEPT
from that table- Drop external table
I prefer this approach. Notably this temporary table should be created in the tmp
dataset which is advertised as temporary space for loading parquet data on dtmo. The plan is to re-characterize that dataset as a staging area for datasets generally and to remove the broad read access that it currently has. In this manner we can align various types of loading that require staging without the proliferation of additional raw datasets.
Updated•6 years ago
|
Assignee | ||
Comment 7•6 years ago
|
||
(In reply to Wesley Dawson [:whd] from comment #6)
(In reply to Frank Bertsch [:frank] from comment #4)
One could be to no longer have these be external tables and instead load the data into BQ. Process would be:
- Create external table
- Add to BQ-native table as
SELECT * EXCEPT
from that table- Drop external table
I prefer this approach. Notably this temporary table should be created in the
tmp
dataset which is advertised as temporary space for loading parquet data on dtmo. The plan is to re-characterize that dataset as a staging area for datasets generally and to remove the broad read access that it currently has. In this manner we can align various types of loading that require staging without the proliferation of additional raw datasets.
That should be fine. Downside here is schema changes, but I'll handle this if and when they happen.
Moving to Mobile and I'll take this.
Assignee | ||
Comment 8•6 years ago
|
||
This has been updated and merged in https://github.com/mozilla/leanplum-data-export/pull/7.
:whd, it looks like the job is failing b/c of permissions issues with the tmp
dataset:
[2019-11-20 14:55:27,521] {logging_mixin.py:95} INFO - [2019-11-20 14:55:27,521] {pod_launcher.py:104} INFO - google.api_core.exceptions.Forbidden: 403 POST https://bigquery.googleapis.com/bigquery/v2/projects/moz-fx-data-shared-prod/datasets/tmp/tables: Access Denied: Dataset moz-fx-data-shared-prod:tmp: User does not have bigquery.tables.create permission for dataset moz-fx-data-shared-prod:tmp.
Would you be able to update those?
Comment 9•6 years ago
|
||
I've filed https://github.com/mozilla-services/cloudops-infra/pull/1635 and will update the bug when it has been merged and deployed.
Comment 10•6 years ago
|
||
https://github.com/mozilla-services/cloudops-infra/pull/1635 has been applied, and an airflow rerun of the failed day succeeded.
Assignee | ||
Comment 11•6 years ago
|
||
(In reply to Wesley Dawson [:whd] from comment #10)
https://github.com/mozilla-services/cloudops-infra/pull/1635 has been applied, and an airflow rerun of the failed day succeeded.
Perfect, let's close this out.
Updated•3 years ago
|
Description
•