Closed Bug 1653654 Opened 4 years ago Closed 4 years ago

Request for bigquery job to forward AET identifier to AET pipeline from FxA oauth token created events

Categories

(Data Platform and Tools :: General, enhancement, P2)

enhancement
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: davejustishh, Assigned: klukas)

References

Details

Attachments

(3 files)

Hello! We've added an AET identifier (ecosystem_anon_id) to the fxa_activity - oauth_access_token_created event.

We need a bigquery-etl job to be setup to find this event in the fxa-auth-server logs and forward it to the AET endpoint using the AET event schema.

2:02
I think we decided the ecosystem_client_id could be null for server events, so maybe the only thing they need to relay is the eco_anon_id from the server. That’s probably the open question in the bug - do we need anything else in this event?

slack message from Jared Hirsch

We've already gotten approval from trust. Here is the pr with the change on our side https://github.com/mozilla/fxa/pull/5929

Are there any particular time constraints for getting this pipeline set up?

Running a query that sends data to the pipeline is not a concept that currently exists and I'm not sure it's the best way to approach this. If possible, I'd prefer for FxA servers to handle making the requires to the telemetry edge, or we could discuss provisioning a Pub/Sub topic that the servers can publish to. There's overlap here with design discussions I've had in the past with :_6a68 in regards to sending other FxA metrics through the pipeline.

I'd like to discuss with :_6a68 and :rfkelly to see if there's any previous context on this that I'm forgetting and to think through any potential risk scenarios.

I'd like to discuss with :_6a68 and :rfkelly to see if there's any previous context on this that I'm forgetting and to think through any potential risk scenarios.

I don't think we ever worked out a plan for AET ingestion in the metrics pipeline planning sessions earlier this year.

Submitting AET events via the ingestion edge is a straightforward task from the FxA side. We can definitely do that instead.

Using the Node PubSub library, sending directly to Pub/Sub would look roughly like:

   // We'll provide the relevant topic name, provisioned in a data ops project
  topicName = ...

    // JSON payload, preferably gzipped
    dataBuffer = ...

    // Add attributes to the message; normally, the telemetry edge would have created these based on the http request and server time
    const customAttributes = {
      uri: '/submit/firefox-accounts/account-ecosystem/1/2f069936-5e43-4cb1-901a-4e0f14fa6b51', // inject a different random UUID here for each message
      submission_timestamp: '2020-07-07 00:00:30.419191 UTC', // inject current server time
      user_agent: 'FxA Server', // Nothing depends on this field, but could be useful for debugging
    };

    const messageId = await pubSubClient
      .topic(topicName)
      .publish(dataBuffer, customAttributes);

Based on one of the examples

(In reply to Jared Hirsch [:_6a68] [:jhirsch] (Needinfo please) from comment #2)

Submitting AET events via the ingestion edge is a straightforward task from the FxA side. We can definitely do that instead.

I advocated against this in Berlin, because for server-side applications Mozilla controls it makes more sense to me to just use pubsub directly.

I don't actually remember what's supposed to go into this firefox-accounts pipeline family from https://bugzilla.mozilla.org/show_bug.cgi?id=1619020 but I think my preferred approach here would be to add an -aet branch to this family (which was already provisioned in stage) and give fxa servers access to write to that.

Some advantages of skipping the edge:

  • reduced costs
  • reduction in the number of hops make successful delivery more likely in fewer attempts, and edge QoS/load is not impacted by fxa server-side requests
  • avoiding edge issues like https://bugzilla.mozilla.org/show_bug.cgi?id=1652789 or issues caused by edge outage
    • in an edge outage situation, fxa servers must implement retry logic (hopefully with exponential backoff) similar to standard telemetry clients
    • counterpoint: in a pubsub outage, the edge is actually designed to be resilient to pubsub failure and will accept messages and queue locally until pubsub is back online; fxa servers would be expected to retry the pubsub API
  • additionally (and this may go away) depending on where fxa lives (probably still AWS) the edge does cross-cloud routing over https and we should avoid that if possible

Semantically, I consider servers managed by Mozilla to be a separate class of application from clients that use the ingestion edge endpoint because they can take advantage of optimizations like publishing directly to pubsub. However, this distinction is also somewhat arbitrary and if FxA prefers the POST API with the caveats above, that's fine by me.

I'd also like :jbuck or :jrgm's opinion here if they have one.

My preferred solution here would be tailing Stackdriver Logs into a pubsub queue that the metrics pipeline consumes.

The application is already logging this data, and it's very easy to setup a log sink to pubsub. The only thing that might be annoying would be rewriting the data payload to match the format that telemetry is expecting, but that should just be some JSON parsing/stringifying and sending into another pubsub queue.

My preferred solution here would be tailing Stackdriver Logs into a pubsub queue that the metrics pipeline consumes.

It looks like we're going with this approach. This means the integration point between fxa infra and data infra will be a pubsub topic. The cloud function or similar ETL performing the logic in https://bugzilla.mozilla.org/show_bug.cgi?id=1653654#c3 will live fxa-side, and will publish data to a pubsub topic provisioned data-side.

The stage topic will initially be projects/moz-fx-data-shar-nonprod-efed/topics/structured-aet (which the http edge is a frontend for), but may change to projects/moz-fx-data-shar-nonprod-efed/topics/firefox_accounts-aet depending on conversations in August about the priority of https://bugzilla.mozilla.org/show_bug.cgi?id=1619020 and related work. :jbuck will supply me with the service account to grant publish access to when it's provisioned, after which the stage environment should be fully operational.

The stage service account is fxa-stage-aet@moz-fx-fxa-nonprod-375e.iam.gserviceaccount.com

I've added publish permissions for that SA to projects/moz-fx-data-shar-nonprod-efed/topics/structured-aet in my WIP branch https://github.com/mozilla-services/cloudops-infra/pull/2229/, so that SA should now be able to publish messages.

Summary: Request for bigquery job to forward AET identifier to AET pipeline from fxa-auth-server logs → Request for bigquery job to forward AET identifier to AET pipeline from FxA oauth token created events

I just had a long discussion with :whd about possible approaches to this problem of routing AET-related FxA Stackdriver events through the data pipeline.

The main difficulty here seems to be that somewhere in the infrastructure we need a log parser to get from the log format message into a format compatible with a destination table in BigQuery. That logic could exist either FxA-side in something like a cloud function as :jbuck suggested, or that logic could exist data pipeline-side.

If we do the log parsing data pipeline-side, then we can plug into the data pipeline's error routing and we avoid having to make the telemetry edge server's Pub/Sub message format a public interface. Assuming FxA's logs are emitted in mozlog format, it would be nice long-term to develop a fairly generic system for transforming mozlog Stackdriver messages into messages suitable for the pipeline.

As a practical concern, it also appears that availability for developing log parsing logic may be least constrained on the data platform side at this point.

So, :whd and I are proposing the following:

  • We grant publish permissions on the structured-aet topic to the relevant service account used by Stackdriver log sinks on the FxA side
  • :jbuck configures a Stackdriver sink that selects only AET-related events and publishes them to the structured-aet topic
  • :klukas adds logic to the Decoder job to detect Stackdriver-formatted logs, transforming them into a format suitable for the destination firefox-accounts/account-ecosystem document type

We would initially add specific support for just the fxa_activity - oauth_access_token_created event of this bug and the account.updateEcosystemAnonId.complete event discussed in bug 1656949.

As future steps, we would want to more fully look into the mozlog format and whether we can generically handle ingesting mozlog messages from Stackdriver to reduce the need for bespoke logic for each new event type.

?ni Jared - Does the above sound more viable as a step toward a long-term solution compared to transforming data FxA-side? In particular, this route gives us a story for error monitoring.

Flags: needinfo?(jhirsch)

+1, great summary :klukas. That plan sounds great to me.

Flags: needinfo?(jhirsch)
Points: --- → 3
Priority: -- → P2
Assignee: nobody → jklukas

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #9)

PR with changes from in this comment: https://github.com/mozilla-services/cloudops-infra/pull/2403

I have been able to find some of these events in BigQuery and in Stackdriver logging interface with jsonPayload.Fields.event = oauth.token.created including some that look to have legitimate values under jsonPayload.Fields.ecosystemanonid. I think this is enough to get me started on building support in the decoder.

In particular, my plan here will be to look for events with event set to oauth.token.created, and then extract only the ecosystemanonid value, the timestamp field (mapped to submission_timestamp in the pipeline), userAgent (which the pipeline will decode), insertId (which we will coerce into a document_id used for deduplication). So we'll be dropping most of the message. I will plan to transform these so they get routed to the firefox_accounts_live.account_ecosystem_v1 table in BQ.

Once we have that support merged and verified as working, we can discuss whether there are additional attributes we want to parse out of these messages. In particular, we probably want geo, but that's going to be a bit tough since these messages have already parsed out geo fields, and they are in a different format than what's expected in the pipeline (long-form country names rather than 2 character country codes, for example).

Some more PRs required for prod:

These have been landed, and the pipeline is successfully processing them.

Once we have that support merged and verified as working, we can discuss whether there are additional attributes we want to parse
out of these messages.

Just dropping by to say that we'll definitely want the OAuth client_id from these events (or some other way of identifying which application they're linked back to) in order to correlate with the AET data submitted by each application.

Just dropping by to say that we'll definitely want the OAuth client_id from these events (or some other way of identifying which application they're linked back to) in order to correlate with the AET data submitted by each application.

It looks like event_properties is always null for these "oauth.token.created" events from what I can see in BQ, which is where I'd expect oauthclient_id to be populated. I don't know whether that's expected from the FxA side for these events.

I've filed https://github.com/mozilla/fxa/issues/6290 to add the client_id (and, if it's not too difficult, the human-readable client name) to the oauth.token.created events.

I've just landed the change to include the oauth client_id in these events as the clientId field. This should be on stage later today or Monday, as part of the 187.2 train release. The deployment bug is https://bugzilla.mozilla.org/show_bug.cgi?id=1664244, if you'd like to follow along.

I'm not sure whether we want to do the client_id -> client_name mappings inside AET BQ. If we do, FxA maintains a spreadsheet of client_id to readable client name mappings, and our ops folks have access to it.

It looks like event_properties is always null for these "oauth.token.created" events from what I can see in BQ, which is where I'd expect oauthclient_id to be populated. I don't know whether that's expected from the FxA side for these events.

Hmm. From glancing at the FxA metrics code, it looks like event_properties is only set on events that are intended for amplitude, which these aren't. I think you should be able to find this as jsonPayload.fields.clientId. If you need fields in the event_properties on the event, feel free to file a bug at https://github.com/mozilla/fxa/issues/new and I can make the changes.

I think you should be able to find this as jsonPayload.fields.clientId

The following query shows that jsonPayload.fields.client_id is never populated for these events:

SELECT jsonPayload.fields.client_id, count(*) FROM `moz-fx-fxa-prod-0712.fxa_prod_logs.docker_fxa_auth_20200911` 
where jsonPayload.fields.event = 'oauth.token.created'
group by 1

If you need fields in the event_properties on the event

I can adjust the pipeline to pull these from wherever in the payload, but as far as I can tell the clientId is not present at all, unless this field is somehow dropped in the Stackdriver -> BQ fxa_prod_logs step.

Ah, to be clear, the change to add the clientId (note camelCase, not snake_case) to the event just landed but hasn't been released yet. It should hit staging today and production by the middle of next week, pending QA signoff. Bug 1664244 will have updates as the release continues.

I now see records flowing in with non-null values of jsonPayload.fields.clientid. I see 9 distinct values in:

SELECT jsonPayload.fields.clientid, count(*) 
FROM `moz-fx-fxa-prod-0712.fxa_prod_logs.docker_fxa_auth_20200914` 
WHERE jsonPayload.fields.event = 'oauth.token.created'
GROUP BY 1

These events are now flowing and available in BigQuery, including geo and oauth_client_id. Example query:

SELECT
  oauth_client_id,
  COUNT(*)
FROM
  `moz-fx-data-shared-prod.firefox_accounts.account_ecosystem`
WHERE
  DATE(submission_timestamp) = '2020-09-22'
GROUP BY
  1
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: