Closed Bug 1654558 Opened 2 years ago Closed 2 years ago

Implement one-off handling to restrict access to XFO/CSP error reports

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Assigned: klukas)

Details

Attachments

(2 files)

In the interests of unblocking Bug 1647825, which is needed to investigate the scope of a potential user-impacting problem, we can implement a one-off solution ahead of the more generalized support being developed in Bug 1651005.

Once the general solution is ready, we will migrate this data collection to use that solution.

As discussed earlier this week, this one-off solution will not include changes to the data platform architecture (in particular access to data via PubSub and payload_bytes_* tables) but will restrict access to the live and stable tables in BigQuery.

I've consulted Alicia Gray from Trust & Safety (thanks Alicia!) and she is OK with this as an interim solution.

This should be transparent from the client Telemetry viewpoint, changes will be handled in the data pipeline routing and provisioning.

The specific schema being addressed here:
https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/570

Christoph, who will need access to do analysis on this data?

Flags: needinfo?(ckerschb)
Points: --- → 2
Priority: -- → P1

Mark, thanks for the update, that sounds like a solid plan forward to me.

(In reply to Mark Reid [:mreid] from comment #1)

Christoph, who will need access to do analysis on this data?

I am not entirely sure how to answer this question. Generally no one outside the Security Engineering Team needs access to that data, but we could restrict it even further to some individuals if necessary.

Flags: needinfo?(ckerschb)

:whd and I are coordinating on getting all relevant infra changes deployed within the next few days. Basic steps:

  • Merge the new schema
  • :klukas stages a new m-p-s PR that duplicates the schema to a new namespace
  • :whd preps ops logic to set restricted access on the new datasets associated with the new namespace
  • :klukas stages a gcp-ingestion change that adds one-off logic for rerouting telemetry/xfocsp-error-report pings to the new restricted namespace

Then we'll coordinate on proper sequencing for merging and applying all the above changes.

(In reply to Christoph Kerschbaumer [:ckerschb] from comment #2)

Mark, thanks for the update, that sounds like a solid plan forward to me.

(In reply to Mark Reid [:mreid] from comment #1)

Christoph, who will need access to do analysis on this data?

I am not entirely sure how to answer this question. Generally no one outside the Security Engineering Team needs access to that data, but we could restrict it even further to some individuals if necessary.

We will need a list individuals so that we can restrict access on Redash. If you can point me to the team on people.mozilla.org or provide a list of emails that would be great.

Also will you need to be able to join against other telemetry tables as part of this analysis?

PRs are in place and I'm going to roll out the ingestion pieces of this presently, in the following order:

  1. Merge https://github.com/mozilla-services/cloudops-infra/pull/2337 and run the access-groups jenkins job
  2. Merge and build https://github.com/mozilla/gcp-ingestion/pull/1323, which will trigger a stage deploy
  3. Merge https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/581 and re-run MPS
  4. Verify stage deployment
  5. Proceed with production deployment

This will set up our infrastructure to ingest and route data, and :jason is working separately on the re:dash access pieces.

The above steps are complete. There is some additional nuance that I didn't consider WRT pipeline families and document rerouting that was made apparent during my verification of stage, and which was mostly resolved by https://github.com/mozilla-services/cloudops-infra/pull/2339. It is something to consider as we develop the general solution, since we want to avoid special cases and having 3 pbd tables for the same ping is probably not the way to go.

This nuance is also reflected in BQ schemas generation, and specifically that we have different schemas metadata based on the distinction "telemetry" vs. "not telemetry". When I sent a test ping to stage, it looked like the following:

{
    "additional_properties": "{\"type\":\"xfocsp-error-report\",\"id\":\"0cb9115c-471b-164b-9630-9fb75567eca8\",\"creationDate\":\"2020-07-10T11:42:30.579Z\",\"version\":4,\"application\":{\"architecture\":\"x86-64\",\"buildId\":\"20200710134107\",\"name\":\"Firefox\",\"version\":\"80.0a1\",\"displayVersion\":\"80.0a1\",\"vendor\":\"Mozilla\",\"platformVersion\":\"80.0a1\",\"xpcomAbi\":\"x86_64-gcc3\",\"channel\":\"default\"}}",
    "document_id": "2b9a803c-67a6-490d-aa8f-b8b7864e8279",
    "metadata": {
      "geo": {
        "city": "Sunnyvale",
        "country": "US",
        "db_version": "2020-07-20T15:47:44Z",
        "subdivision1": "CA",
        "subdivision2": null
      },
      "header": {
        "date": null,
        "dnt": null,
        "x_debug_id": "whd",
        "x_pingsender_version": null
      },
      "isp": {
        "db_version": "2020-07-20T14:48:19Z",
        "name": "Comcast Cable",
        "organization": "Comcast Cable"
      },
      "user_agent": {
        "browser": null,
        "os": null,
        "version": null
      }
    },
    "normalized_app_name": "Firefox",
    "normalized_channel": "Other",
    "normalized_country_code": "US",
    "normalized_os": null,
    "normalized_os_version": null,
    "payload": {
      "csp_header": "",
      "error_type": "xfo",
      "frame_hostname": "example.com",
      "frame_uri": "http://example.com/browser/dom/security/test/general/file_framing_error_pages.sjs",
      "top_hostname": "example.com",
      "top_uri": "http//example.com/browser/dom/security/test/general/file_framing_error_pages_xfo.html",
      "xfo_header": "deny"
    },
    "sample_id": null,
    "submission_timestamp": "2020-07-22 22:20:44.655699 UTC"
  }

This is probably not the representation we want, in particular wrt client_id for deletion purposes (which may require shredder changes anyway). However, since it appears structured metadata is a subset of telemetry metadata, I didn't consider this a blocking issue and it's something we can resolve for this special case relatively easily (by updating MSG to special case inject telemetry metadata for the rerouted namespace). We should make this change relatively soon however, and I've asked :amiyaguchi to prepare the MSG change so that I can apply it imminently.

As for ACLs, I have verified that the namespace tables (live, stable, decoded etc.) for the new xfocsp_error_report bq dataset family have restrictive ACLs applied and aren't e.g. available for general access via re:dash. I've also verified with the above payload that rerouting works as we expect. There's a compatibility user-facing dataset for xfocsp_error_report defined in the derived-datasets project that is re:dash accessible, but queries to that dataset will result in an error when trying to access tables because the underlying dataset is inaccessible to re:dash and the view isn't authorized.

In summary I believe we're currently able to ingest, reroute, and restrict access to this data and for the most part https://bugzilla.mozilla.org/show_bug.cgi?id=1647825 is unblocked, but we should apply the MSG change and redeploy schemas before ingesting data if possible (otherwise we'll have to backfill or delete data).

:amiyaguchi's MSG PR has been merged, but it doesn't address the actual issue I'm seeing which is that the validation document includes a client_id, but it's not specified in the schema. So we need to do one of the following:

  1. remove client_id from the sample document used for validation
    (separately this year :amiyaguchi is going to add some CI that tells us how much of the validation payload is left unspecified and therefore goes to additional_properties; this should be minimized for new schemas)

  2. add client_id to the schema

It's probably possible to glean whether this ping is supposed to have a client_id from data review/client code but in a cursory inspection I couldn't find it. With one or the other of the above changes I think we can close this out (EDIT: also the actual users list to grant access to per comment #4).

(In reply to Wesley Dawson [:whd] from comment #6)

:jason is working separately on the re:dash access pieces.

I've created a new redash group bigquery_xfocsp and data source Telemery XFOCSP (BigQuery). Only users in this redash group will have access to this data source. I've added myself to the group for testing purposes.

(In reply to Wesley Dawson [:whd] from comment #9)

:amiyaguchi's MSG PR has been merged, but it doesn't address the actual issue I'm seeing which is that the validation document includes a client_id, but it's not specified in the schema. So we need to do one of the following:

  1. remove client_id from the sample document used for validation
    (separately this year :amiyaguchi is going to add some CI that tells us how much of the validation payload is left unspecified and therefore goes to additional_properties; this should be minimized for new schemas)

This is the thing we should do - these pings should not include a client_id:
https://searchfox.org/mozilla-central/source/browser/actors/NetErrorParent.jsm#151

I filed a PR to remove the ID field but it develops that the id is actually document_id, which is likely being sent as part of the common ping format. :klukas is going to file a followup schemas PR to revert my change and add the fields that are currently unspecified under the assumption that the validation payload is accurate.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Ok :klukas PR is in at https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/583 (which is really unrelated to the blocking security aspects anyway) and will be deployed on the standard schemas cadence. I think we can call this work finished.

Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED

(In reply to Jason Thomas [:jason] from comment #10)

(In reply to Wesley Dawson [:whd] from comment #6)

:jason is working separately on the re:dash access pieces.

I've created a new redash group bigquery_xfocsp and data source Telemery XFOCSP (BigQuery). Only users in this redash group will have access to this data source. I've added myself to the group for testing purposes.

Can you add the following users to have access to the data as well, please?

{jhofmann,ckerschbaumer,tihuang}@mozilla.com

Thanks!

Flags: needinfo?(jthomas)

Done.

Flags: needinfo?(jthomas)
You need to log in before you can comment on or make changes to this bug.