Closed Bug 1593268 Opened 6 years ago Closed 6 years ago

Limit access to lat/long fields of leanplum_sessions

Categories

(Data Platform and Tools :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: frank, Assigned: frank)

Details

(Whiteboard: [DataOps])

This is basically a bug to cover how we want to disable access to the lat and lon fields of leanplum_sessions_v1 (currently only available in firefox_android_nightly_external).

My initial take is to:
0. Create a new dataset firefox_android_nightly_external_raw, which we use as the basis for the current tables flowing in

  1. Lockdown access to firefox_android_external_raw.leanplum_sessions_v1
  2. Create an authorized view on it in firefox_android_nightly_external, without those fields exposed. (I'd also create views on the rest of the tables there).

Results of this approach:

  • We would need {product}_{channel}_external_raw datasets for all the product/channels (good and bad - more datasets, but also more flexibility)
  • We could limit other PII we get in the future from external datasets (this approach lets us handle PII separately from import, since touching the data while it's in-flight to e.g. drop columns is error-prone)

We could drop those columns in-flight, but I'd rather avoid that if possible. (I can't disable them in the API request).

Currently we have an an external data GCS bucket and some bigquery datasets to house external tables into this bucket. It should be possible to define the columns explicitly when creating these external tables e.g. with https://cloud.google.com/bigquery/external-table-definition#creating_a_table_definition_using_a_json_schema_file which would limit the access to those columns from bigquery. Does this approach seam reasonable?

(In reply to Wesley Dawson [:whd] from comment #1)

Currently we have an an external data GCS bucket and some bigquery datasets to house external tables into this bucket. It should be possible to define the columns explicitly when creating these external tables e.g. with https://cloud.google.com/bigquery/external-table-definition#creating_a_table_definition_using_a_json_schema_file which would limit the access to those columns from bigquery. Does this approach seam reasonable?

The downside with this approach is that we can no longer rely on the automated schema inference for the rest of the columns in that dataset, and unfortunately Leanplum doesn't guarantee a schema of any kind (so we'd end up inferring it ourselves).

We could:

  1. Add the full table in BQ, with inferred schema
  2. Get the schema of that table
  3. Remove lat/lon from that schema
  4. Recreate the table with the updated schema

Basically a workaround. If you'd rather I went this direction, that's fine.

Flags: needinfo?(whd)

(In reply to Frank Bertsch [:frank] from comment #2)

Basically a workaround. If you'd rather I went this direction, that's fine.

I'd prefer this, assuming it's a one-time thing (i.e. leanplum is not constantly changing their schema).

Flags: needinfo?(whd)

(In reply to Wesley Dawson [:whd] from comment #3)

(In reply to Frank Bertsch [:frank] from comment #2)

Basically a workaround. If you'd rather I went this direction, that's fine.

I'd prefer this, assuming it's a one-time thing (i.e. leanplum is not constantly changing their schema).

Unfortunately this won't work. We're going to have to go for another workaround. EDIT: The issue is that you cannot create a schema that doesn't include all of the fields in the csv file. When you try to query the table, you get this error: Error while reading table: sandbox.test_data, error message: Too many values in row starting at position: 0.

One could be to no longer have these be external tables and instead load the data into BQ. Process would be:

  1. Create external table
  2. Add to BQ-native table as SELECT * EXCEPT from that table
  3. Drop external table

Wesley, I laid out another alternative but I would prefer what is mentioned in comment #1. That also aligns with what we're doing for telemetry data coming through the pipeline (views on top of raw data).

Flags: needinfo?(whd)

(In reply to Frank Bertsch [:frank] from comment #4)

One could be to no longer have these be external tables and instead load the data into BQ. Process would be:

  1. Create external table
  2. Add to BQ-native table as SELECT * EXCEPT from that table
  3. Drop external table

I prefer this approach. Notably this temporary table should be created in the tmp dataset which is advertised as temporary space for loading parquet data on dtmo. The plan is to re-characterize that dataset as a staging area for datasets generally and to remove the broad read access that it currently has. In this manner we can align various types of loading that require staging without the proliferation of additional raw datasets.

Flags: needinfo?(whd)
Assignee: nobody → whd
Whiteboard: [DataOps]

(In reply to Wesley Dawson [:whd] from comment #6)

(In reply to Frank Bertsch [:frank] from comment #4)

One could be to no longer have these be external tables and instead load the data into BQ. Process would be:

  1. Create external table
  2. Add to BQ-native table as SELECT * EXCEPT from that table
  3. Drop external table

I prefer this approach. Notably this temporary table should be created in the tmp dataset which is advertised as temporary space for loading parquet data on dtmo. The plan is to re-characterize that dataset as a staging area for datasets generally and to remove the broad read access that it currently has. In this manner we can align various types of loading that require staging without the proliferation of additional raw datasets.

That should be fine. Downside here is schema changes, but I'll handle this if and when they happen.

Moving to Mobile and I'll take this.

Assignee: whd → fbertsch
Component: Operations → Datasets: Mobile
QA Contact: moconnor

This has been updated and merged in https://github.com/mozilla/leanplum-data-export/pull/7.

:whd, it looks like the job is failing b/c of permissions issues with the tmp dataset:

[2019-11-20 14:55:27,521] {logging_mixin.py:95} INFO - [2019-11-20 14:55:27,521] {pod_launcher.py:104} INFO - google.api_core.exceptions.Forbidden: 403 POST https://bigquery.googleapis.com/bigquery/v2/projects/moz-fx-data-shared-prod/datasets/tmp/tables: Access Denied: Dataset moz-fx-data-shared-prod:tmp: User does not have bigquery.tables.create permission for dataset moz-fx-data-shared-prod:tmp.

Would you be able to update those?

Flags: needinfo?(whd)

I've filed https://github.com/mozilla-services/cloudops-infra/pull/1635 and will update the bug when it has been merged and deployed.

Flags: needinfo?(whd)

https://github.com/mozilla-services/cloudops-infra/pull/1635 has been applied, and an airflow rerun of the failed day succeeded.

(In reply to Wesley Dawson [:whd] from comment #10)

https://github.com/mozilla-services/cloudops-infra/pull/1635 has been applied, and an airflow rerun of the failed day succeeded.

Perfect, let's close this out.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Datasets: Mobile → General
You need to log in before you can comment on or make changes to this bug.