Closed Bug 1649871 Opened 4 years ago Closed 4 years ago

Manage event volume of new "fxa_activity - oauth_access_token_created" event

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement
Points:
1

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhirsch, Assigned: jhirsch)

Details

Attachments

(3 files, 1 obsolete file)

This is a followup from bug 1632635.

Sync is in the process of moving from using browserid assertions to oauth access tokens, and this change is already completed for mobile. We've lost visibility into the number of devices used by these users, as the daily device info was recorded in the fxa_activity - cert_signed events fired when a browserid assertion was signed by the auth-server.[1]

I'm just about to land[2] a new amplitude event in FxA to restore the missing device usage data, by instead recording an event when oauth access tokens are created, and appending the device data to that event.

The new event name is fxa_activity - oauth_access_token_created.

This should hit production next week. In parallel, ops will lower the FxA oauth access token TTL from 2 days to 6 hours, which is the same TTL as the old browserid assertions, so we'll have the same need for per-user rollups that we had with the old cert_signed events.

It looks like a bigquery-etl PR will be needed, and one in cloudops-infra as well. I'll open those now and attach to this bug.

[1] https://github.com/mozilla/fxa/issues/5143
[2] https://github.com/mozilla/fxa/pull/5820

Hope it's okay to assign this bug to you, Frank :-)

I'm planning to do the work, I'll just need reviews in case I've missed something.

We'll want to leave the fxa_activity - cert_signed rollups in place until desktop switches from browserid to oauth at some point in the future.

Assignee: nobody → fbertsch
Attachment #9160754 - Attachment is obsolete: true
Attachment #9160755 - Flags: review?(jklukas)
Attachment #9160755 - Flags: review?(fbertsch)
Comment on attachment 9160762 [details] [review] GitHub Pull Request Cloudops-infra PR to avoid this new high-volume event from blowing our Amplitude budget
Attachment #9160762 - Flags: review?(jrgm)
Attachment #9160762 - Flags: review?(jbuckley)
Comment on attachment 9160755 [details] Link to BigQuery-ETL pull request adding support for the new event Frank is on PTO today, but the change looks good to me. I went ahead and merged, and Frank can comment when he gets back if he has any concerns.
Attachment #9160755 - Flags: review?(jklukas) → review+
Points: --- → 1
Priority: -- → P1
Attachment #9160762 - Flags: review?(jbuckley) → review+

The patch in #c5 has been applied in production

Comment on attachment 9160755 [details] Link to BigQuery-ETL pull request adding support for the new event This solution looks good to me. Looks like there's nothing for me to do on this bug now!
Attachment #9160755 - Flags: review?(fbertsch) → review+

Jared, assigning this bug back to you since it looks like you already completed it. Let me know if I'm missing something.

Assignee: fbertsch → jhirsch
Flags: needinfo?(jhirsch)

Thanks, Frank! Looks like the two pull requests have landed, so I think this is good to close.

Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(jhirsch)
Resolution: --- → FIXED

Hey Frank, is it possible to verify that a non-zero number of these new events are arriving and are getting added to the rollup counts?

Status: RESOLVED → REOPENED
Flags: needinfo?(fbertsch)
Resolution: FIXED → ---

:klukas, shifting the ni? your way

Flags: needinfo?(jklukas)

From

SELECT
  jsonPayload.fields.event_type,
  COUNT(*)
FROM
  `moz-fx-fxa-prod-0712.fxa_prod_logs.docker_fxa_auth_20200817`
GROUP BY
  1

I see 9219819 "fxa_activity - oauth_access_token_created" events in yesterday's source table. So these events are definitely arriving, and I based on the current logic in the query they should definitely be included in the rollups.

Flags: needinfo?(jklukas)

Sorry for the delay here - the best we can do is what :klukas queried. We don't have any count of what the events are in Amplitude. Any particular reason you think they might not be getting sent over?

Flags: needinfo?(fbertsch) → needinfo?(jhirsch)

Alex is seeing some unexpectedly low usage numbers in Amplitude. I think he's on PTO this week, but I'll redirect the needinfo his way so he can share a link to an amplitude chart

Flags: needinfo?(jhirsch) → needinfo?(adavis)

When Firefox for iOS moved over to OAuth in May, our multi-device roll-up numbers tanked in Amplitude. Other metrics changes did happen then (like the activity roll-up this bug focused on) so perhaps one of the others also impacted this.

At the beginning of August, as Android users have started to move from Fennec to Fenix (OAuth), we've seen our multi-device roll-up number slowly drop too.

With desktop moving to OAuth last week, we're seeing the roll-up number drop even further.

https://analytics.amplitude.com/mozilla-corp/chart/n0byr37

Leif might have other ideas.

Flags: needinfo?(adavis) → needinfo?(loines)

I think the drop in Alex's chart relates to this issue.

That issue is now closed, but seems (?) also to maybe have been blocked on this issue, which is still open.

Jared, in FXA-1780 (former link above) you write:

  1. Update the FxA server metrics code to derive the missing sync usage events from the fxa_activity - access_token_created event, generated when an oauth token is created, instead of the no longer reliable fxa_activity - cert_signed event, which was generated when a browserid assertion was signed.

Did we ever do this? If not, I think it would explain why our sync_active_devices_* metrics are tanking,

Flags: needinfo?(loines) → needinfo?(jhirsch)

FWIW we could update the Amplitude export to add a new user property that tells us which events the meta-event is derived from. It wouldn't take long to figure out then which activity event was affecting the change.

(In reply to Leif Oines [:loines] from comment #17)

Jared, in FXA-1780 (former link above) you write:

  1. Update the FxA server metrics code to derive the missing sync usage events from the fxa_activity - access_token_created event, generated when an oauth token is created, instead of the no longer reliable fxa_activity - cert_signed event, which was generated when a browserid assertion was signed.

Did we ever do this? If not, I think it would explain why our sync_active_devices_* metrics are tanking,

Yes, at least I think so:

  • the fxa_activity - oauth_access_token_created event was added to FxA in https://github.com/mozilla/fxa/pull/5820, meaning these events are in the FxA server logs that are routed into BigQuery
  • the sync device counts are generated by bigquery-etl code, and this very bug contains the patch that added the new event to those rollups (comment #3) as well as a patch that filters this high-volume event out of amplitude, as we've done for the other oauth events (comment #5)
  • in comment #13, klukas confirmed that these new events are in the FxA logs in BigQuery and should be present in the rollups.
Flags: needinfo?(jhirsch)

(In reply to Frank Bertsch [:frank] from comment #18)

FWIW we could update the Amplitude export to add a new user property that tells us which events the meta-event is derived from. It wouldn't take long to figure out then which activity event was affecting the change.

Sure, that sounds like a good idea to me. Is this an easy change to make?

Flags: needinfo?(fbertsch)

Sure, that sounds like a good idea to me. Is this an easy change to make?

It should be relatively straightforward. Let me take a look.

Flags: needinfo?(fbertsch)

Thanks for the info jared.

So the desktop browser stopped using browser_id assertions in version 80, which definitely means we will not be getting the fxa_activity - cert_signed events going forward. However, this should not have caused the drop we are seeing since virtually all these clients should also be generating the fxa_activity - oauth_access_token_created event as well, and in theory with the same device_id (?).

Taking a look at this query the majority of oauth_access_token_created events have a sync_active_devices_day count of 0. Which is wrong, since the device that generated the event should be counted as active. I think this may be the source of the problem.

Jared, you said that these device counts are computed by the bigquery_etl code, but I think its just taking the max of the value that is provided by the event, which is computed by the FxA server.

Flags: needinfo?(jhirsch)

Jared, I merged the above PR, so on the next data load we will have a new event property called rollup_events that is an array of the distinct events that make up that active event.

Jared, you said that these device counts are computed by the bigquery_etl code, but I think its just taking the max of the value that is provided by the event, which is computed by the FxA server.

I can confirm, this is coming from the FxA server.

Hmm, I wonder if we're not getting a list of devices because of the weird magic of the metricsContext, an issue that came up just this week: https://github.com/mozilla/fxa/issues/6347.

Vijay, you were just poking around at this code. Any idea if this new event needs to pull device counts out of the metricsContext to work?

Flags: needinfo?(jhirsch) → needinfo?(vbudhram)

Any idea if this new event needs to pull device counts out of the metricsContext to work?

I believe we should be doing this already. Quick test of https://github.com/mozilla/fxa/issues/6347 and I see the following amplitude events logged locally

first sync login

16|auth  | 2020-09-11T11:50:46: fxa-auth-server.INFO: amplitudeEvent {"op":"amplitudeEvent","event_type":"fxa_activity - oauth_access_token_created","time":1599839446263,"user_id":"e7fc3c71d8ff45b684160819cd114c64","app_version":"187.0","language":"en-US","os_name":"Mac OS X","os_version":"10.15","event_properties":{},"user_properties":{"ua_browser":"Firefox","ua_version":"80.0","sync_device_count":1,"sync_active_devices_day":1,"sync_active_devices_week":1,"sync_active_devices_month":1}}
16|auth  | 2020-09-11T11:50:46: fxa-auth-server.INFO: amplitudeEvent {"op":"amplitudeEvent","event_type":"fxa_activity - cert_signed","time":1599839446269,"user_id":"e7fc3c71d8ff45b684160819cd114c64","app_version":"187.0","language":"en-US","os_name":"Mac OS X","os_version":"10.15","event_properties":{},"user_properties":{"ua_browser":"Firefox","ua_version":"80.0","sync_device_count":1,"sync_active_devices_day":1,"sync_active_devices_week":1,"sync_active_devices_month":1}}

which shows sync_device_count=1 in the user properties.

second sync login on new profile

16|auth  | 2020-09-11T11:52:32: fxa-auth-server.INFO: amplitudeEvent {"op":"amplitudeEvent","event_type":"fxa_activity - oauth_access_token_created","time":1599839552406,"user_id":"e7fc3c71d8ff45b684160819cd114c64","app_version":"187.0","language":"en-US","os_name":"Mac OS X","os_version":"10.15","event_properties":{},"user_properties":{"ua_browser":"Firefox","ua_version":"80.0","sync_device_count":2,"sync_active_devices_day":2,"sync_active_devices_week":2,"sync_active_devices_month":2}}
16|auth  | 2020-09-11T11:52:32: fxa-auth-server.INFO: amplitudeEvent {"op":"amplitudeEvent","event_type":"fxa_activity - cert_signed","time":1599839552411,"user_id":"e7fc3c71d8ff45b684160819cd114c64","app_version":"187.0","language":"en-US","os_name":"Mac OS X","os_version":"10.15","event_properties":{},"user_properties":{"ua_browser":"Firefox","ua_version":"80.0","sync_device_count":2,"sync_active_devices_day":2,"sync_active_devices_week":2,"sync_active_devices_month":2}}

which shows sync_device_count=2 in the user properties.

From the looks of the log, the counts appear to be correct?

Flags: needinfo?(vbudhram)

What's the next step here?

Flags: needinfo?(jhirsch)

I'm investigating this currently, and it seems like an issue on the FxA server side.

I've got sync working with local FxA, and I'm observing a number of different amplitude events logged with inaccurate sync device counts--in particular, seeing a lot of 0 values.

Closing.

Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Flags: needinfo?(jhirsch)
Resolution: --- → FIXED

In testing the sync sign in flow locally, it looks like the oauth_access_token_created event is fired twice in close succession, with different values for the sync device counts, and the properties of the event vary between the two.

The first time the event is fired, the current login isn't counted towards the device totals. So, for a first login of the day, sync_device_count and sync_active_devices_day will both be 0.

A few seconds later, the event is fired a second time, with the sync device counts correctly incremented by one.

The first event also has some properties missing from the second: device_id, session_id, event_properties = {service: 'sync'}, and the user_properties field has flow_id and $append: { fxa_services_used: 'sync' }. The second event has event_properties = {}, and the other listed properties are simply not present.

Could these differences explain the sync device counts being unusually low? For instance, I noticed the rollup query includes a JSON_EXTRACT_SCALAR... IGNORE NULLS statement for the event_properties.service field. How exactly does IGNORE NULLS work in a case like this? Could the event with no {service: 'sync'} be getting dropped from the rollup?

Here are two pretty-printed example events from local testing, again, these are both fired in the course of one login to desktop sync:

first event:

{
  "op": "amplitudeEvent",
  "event_type": "fxa_activity - oauth_access_token_created",
  "time": 1601421751982,
  "user_id": "1f63e61e519648a88fb8661c60b1fb28",
  "device_id": "3e436ac79c894f29932839e474149281",
  "session_id": 1601421741925,
  "app_version": "189.0",
  "language": "en-US",
  "os_name": "Mac OS X",
  "os_version": "10.15",
  "event_properties": {
    "service": "sync"
  },
  "user_properties": {
    "flow_id": "2de9c8bceed3723c3bd921668e5b74f525910a322a5c2afe3d35d089dac5c339",
    "ua_browser": "Firefox",
    "ua_version": "83.0",
    "$append": {
      "fxa_services_used": "sync"
    },
    "sync_device_count": 0,
    "sync_active_devices_day": 0,
    "sync_active_devices_week": 0,
    "sync_active_devices_month": 0
  }
}

second event:

{
  "op": "amplitudeEvent",
  "event_type": "fxa_activity - oauth_access_token_created",
  "time": 1601421752248,
  "user_id": "1f63e61e519648a88fb8661c60b1fb28",
  "app_version": "189.0",
  "language": "en-US",
  "os_name": "Mac OS X",
  "os_version": "10.15",
  "event_properties": {},
  "user_properties": {
    "ua_browser": "Firefox",
    "ua_version": "83.0",
    "sync_device_count": 1,
    "sync_active_devices_day": 1,
    "sync_active_devices_week": 1,
    "sync_active_devices_month": 1
  }
}
Flags: needinfo?(fbertsch)

Hey Jared,

To answer your question - no, the IGNORE NULL only ignores nulls for that column, it does not drop that row from other aggregates.

To verify this, I created two queries, and both show what the data in Amplitude shows.

This viz shows the number of clients reporting more than 1 of each aggregate in the derived table we export to Amplitude.

This viz is the same query, but on the raw log data.

I believe this shows that the data issue is happening before it even hits the logs.

Flags: needinfo?(fbertsch) → needinfo?(jhirsch)

Thanks for checking, Frank! The sync device count story turns out to be a little complicated[1], but a fix for a related metrics bug landed last week[2] and hopefully that will resolve the sync device undercounting.

[1] https://github.com/mozilla/fxa/issues/6578
[2] https://github.com/mozilla/fxa/pull/6581

Flags: needinfo?(jhirsch)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: