Closed Bug 1632635 Opened 4 years ago Closed 4 years ago

Reduce "fxa_activity - cert_signed" event volume to one per-user per-day

Categories

(Data Platform and Tools :: General, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: frank, Assigned: frank)

Details

Attachments

(3 files)

As an org we're sending more events to Amplitude than we expected when we signed the contract last year. This bug is specifically for FxA, and understanding the "cert_signed" event, which accounts for ~23% of the total number of events we send to Amplitude.

Leif, can you describe what this event is, and if there could be a way to reduce the volume that it's being sent?

Flags: needinfo?(loines)

The event is fired periodically for authenticated and verified sync users on desktop, fennec and ios. It is fired whenever a client hits this endpoint on the FxA auth server. See also this.

My understanding (not an engineer) is that Sync periodically needs to refresh this certificate in order to receive a token from the sync token server that will allow it to continue syncing.

This is one of only two server-side events that are reliably generated and that will let us infer whether a client is still syncing. The other events are the creation and checking of oauth tokens, which are of even higher volume and which we already filter out before they are sent to amplitude. As such it is the primary event that contributes to measures of FxA/Sync DAU and MAU (a user that syncs at least once on a given day should almost always generate one of these events). I know Alex is critically reliant on it for daily monitoring of our metrics.

The FxA team is aware that this is a costly event due to its volume. We are actively exploring ways to sample it, see bug 1592123.

CC Jared and Alex

Flags: needinfo?(loines)

Hey Jared and Jon, we tentatively came up with the idea of sending a single cert_signed event per-user at EOD instead of real-time. I have no knowledge of the current pipeline to Amplitude, but Leif has indicated that it's tailed server logs -> Pubsub -> Amplitude HTTP API.

Do either of you have any idea if this is feasible? It would essentially move processing these events to a batch job, and aggregating per-client. My guess is this is too much data to possibly do on one node, so we'd need something like BQ as an intermediate to do that processing. Once we've aggregated, we could use the same pipeline you already do and push the data to Pubsub and then to Amplitude, hopefully resulting in the same schema with minimal work.

We also need to be clear about which metrics and plots this may affect for any real-time, or daily, visualizations you all use. Any single-event counts should be unaffected, but general user-counting wouldn't be available until EOD. Similarly, retention would be low until these events are sent.

Flags: needinfo?(jhirsch)
Flags: needinfo?(jbuckley)

The idea outlined in Comment 2 would reduce the number of events monthly from ~1.4B to ~250M, a reduction of about 1150M, or about a 19% reduction in total events we send to Amplitude.

The other events are the creation and checking of oauth tokens, which are of even higher volume and which we already filter out
before they are sent to amplitude.

I wanted to chime in here to mention, the fact that we filter these out in Desktop is shaping up to be a blocker for some client-side work the sync team is doing, because we want to move away using BrowserID assertions (which generate cert-signed events) and towards OAuth tokens (which generate oauth-related events). See e.g. Bug 1591312 where we had to back out some work that was heading in that direction, because it would have meant some users disappearing from our MAU.

From my perspective, an ideal solution would send a single activity event per user at EOD, encompassing both cert_signed and token_created events. I don't think we particularly need the ability to distinguish which event it was, only that some FxA-related activity was registered for that user no that day.

This sounds like it would be almost be trivial to implement. All FxA event logs are already sent to BigQuery via a Stackdriver integration, and Airflow has permission to query those tables. Our FxA KPI reporting is based on a series of queries that hit these Stackdriver-created tables, and it sounds desirable that we'd send exactly the same set of users to Amplitude as what we consider active for KPIs.

The difficult thing here is that FxA IDs are hashed using an HMAC key before being sent to Amplitude. The logs in BigQuery have the raw IDs. This is a problem we keep running into in discussing various FxA metrics tasks.

It may be worth at this point considering whether we could make the HMAC key available for Airflow to access. If we did that, we could build a Docker container that would be able to do the same HMAC hashing that the FxA pipeline does. It would pull the list of active IDs for the day via BQ query to the Stackdriver log tables, HMAC them, and write the results to S3 for Amplitude to ingest. We then might be able to reuse that pattern for some other tasks in the FxA metrics migration.

It may be worth at this point considering whether we could make the HMAC key available for Airflow to access. If we did that, we could build a Docker container that would be able to do the same HMAC hashing that the FxA pipeline does. It would pull the list of active IDs for the day via BQ query to the Stackdriver log tables, HMAC them, and write the results to S3 for Amplitude to ingest. We then might be able to reuse that pattern for some other tasks in the FxA metrics migration.

If having the HMAC available to Airflow is a blocker at all, we can push to pub/sub from BQ and use their existing pipeline to load the data. Who do we need to get permission from to enable that key use on our end?

Flags: needinfo?(jklukas)

If having the HMAC available to Airflow is a blocker at all, we can push to pub/sub from BQ and use their existing pipeline to load the data.

We have been essentially relying on that pattern for some other pieces of the FxA migration work. This may indeed be an option.

Who do we need to get permission from to enable that key use on our end?

jbuck may have some good context on that, and perhaps :rfkelly. I really don't know how to reason about the risk surrounding giving enhanced access to that key.

Flags: needinfo?(jklukas)

(In reply to Ryan Kelly [:rfkelly] from comment #4)

From my perspective, an ideal solution would send a single activity event per user at EOD, encompassing both cert_signed and token_created events. I don't think we particularly need the ability to distinguish which event it was, only that some FxA-related activity was registered for that user no that day.

+1 we should definitely do this

From my perspective, an ideal solution would send a single activity event per user at EOD, encompassing both cert_signed and token_created events. I don't think we particularly need the ability to distinguish which event it was, only that some FxA-related activity was registered for that user no that day.

+1 we should definitely do this

Are we already getting the token_created events from Stackdriver? If so that would be almost no additional work on top of what we're already looking at here.

If having the HMAC available to Airflow is a blocker at all, we can push to pub/sub from BQ and use their existing pipeline to load the data.

We have been essentially relying on that pattern for some other pieces of the FxA migration work. This may indeed be an option.

Who do we need to get permission from to enable that key use on our end?

jbuck may have some good context on that, and perhaps :rfkelly. I really don't know how to reason about the risk surrounding giving enhanced access to that key.

Ryan, do you have any context on the issues about making the HMAC key available to our Airflow instance?

Flags: needinfo?(rfkelly)

Updating the bug title to more accurately represent current conversation.

Summary: Understand why "fxa_activity - cert_signed" is ~70% of FxA events in Amplitude → Reduce "fxa_activity - cert_signed" event volume to one per-user per-day

(In reply to Frank Bertsch [:frank] from comment #9)

From my perspective, an ideal solution would send a single activity event per user at EOD, encompassing both cert_signed and token_created events. I don't think we particularly need the ability to distinguish which event it was, only that some FxA-related activity was registered for that user no that day.

+1 we should definitely do this

Are we already getting the token_created events from Stackdriver? If so that would be almost no additional work on top of what we're already looking at here.

Yes they are already there. We just ignore them when sending to amplitude currently, so we would just need to remove that filter.

Hey all, we're hoping to move quickly on this, so responses are appreciated. The current plan is the following:

  1. Create a new table, derived from the FxA data in BQ, that groups by user-days and filters to cert_signed and token_created [0]. For every active day (derived from the timestamp field [1]), we will derived a single event for every user, with name fxa_activity - active. We will omit the event properties oauth_client_id and service from the events.
  2. Create a job to send this data to an FxA vacuum, where it will be loaded into the FxA project. This requires working with Amplitude to get that set up.
  3. Once we've confirmed the user count numbers for the new fx_activity - active event, we can have the FxA pipeline stop sending the cert_created event. There will be some overlap in time where both are sent, but that is acceptable from an analysis perspective.

Note: This plan can probably be acted on quickly but requires us to hash the user ids in the same way as the current FxA pipeline does. We are still waiting on confirmation from the FxA team on whether that is possible.

[0] This filtering isn't strictly required. We could use all events and send a single "activity" per-user per-day, encompassing any activity.
[1] Timestamp is a bit nebulous. Looking over the tables, I see a timestamp field, in addition to a receiveTimestamp field. There is a slight delay from timestamp -> recieveTimestamp. Because the table is partitioned on timestamp, I want to ensure we won't miss any activity when timestamp and recieveTimestamp occur on different dates, where a day boundary occurs between the two.

Couple questions -

  1. There is also an fxa_activity - access_token_checked amplitude event. Can we add that to the list of events that get sent to the vacuum? This may not be strictly necessary, as most clients that generate this event also generate fxa_activity - access_token_created, but I believe there are some cases where a client might only generate the checked event on a given day, which would cause them not to be counted towards DAU if we omit it. I will double check to see how often this happens (client sends one event but not the other).

  2. Is there a plan for adding back in the service and oauth client_id event properties? E.g. could we take the set of all the unique values associated with the activity events for a given user-day, and send them as arrays under the service / oauth_client_id event properties for the rollup event? They are kind of important for segmenting DAU by service. The docs seem to indicate that this at least possible for the HTTP API: https://help.amplitude.com/hc/en-us/articles/204771828-HTTP-API (see the example for event_properties) So for example if a user generated a cert signed event for sync and an access token event for monitor we would do something like

{"service": ["sync", "fx-monitor"],"oauth_client_id":["802d56ef2a9af9fa"]}

(note the sync service does not have an oauth_client_id)

  1. There is also an fxa_activity - access_token_checked amplitude event. Can we add that to the list of events that get sent to the vacuum? This may not be strictly necessary, as most clients that generate this event also generate fxa_activity - access_token_created, but I believe there are some cases where a client might only generate the checked event on a given day, which would cause them not to be counted towards DAU if we omit it. I will double check to see how often this happens (client sends one event but not the other).

Definitely. I mentioned in [0] that we could actually remove any filtering, so that this active event would encompass any activity. I'm not sure how that would play with the event_properties discussed below, though (if e.g. those events are sending different services).

  1. Is there a plan for adding back in the service and oauth client_id event properties? E.g. could we take the set of all the unique values associated with the activity events for a given user-day, and send them as arrays under the service / oauth_client_id event properties for the rollup event? They are kind of important for segmenting DAU by service. The docs seem to indicate that this at least possible for the HTTP API: https://help.amplitude.com/hc/en-us/articles/204771828-HTTP-API (see the example for event_properties) So for example if a user generated a cert signed event for sync and an access token event for monitor we would do something like

{"service": ["sync", "fx-monitor"],"oauth_client_id":["802d56ef2a9af9fa"]}

We can definitely add these back in. It sounds like we would aggregate all services and all oath_client_id, taking the unique set for each. Does this sound like the right approach, Leif?

Flags: needinfo?(loines)

I don't think we need to fret about the event_type -> service mappings, all services except sync can occur with both types of access_token events and the vast majority of cert_signed events are just sync. By aggregating the services and event properties we lose information about what event type was originally associated with what service, but that is not important for analysis.

Your intuition about how to aggregate the service and oauth_client_id fields is correct.

Flags: needinfo?(loines)

I was too fast to submit my last comment:

I think it would be useful to also aggregate the following user properties in a similar way:

sync_active_devices_* (day, week, month), sync_device_count.

If possible we should also aggregate fxa_services_used and then update it using $postInsert as documented here. Although :jbuck & :_6a68 - this postInsert function is only available using the identify API - would that be a problem for us?

Ideally we would also do something similar for OS, Language, Country but reading between the lines here it doesn't appear we'll be able to do that, which is a shame. It means that we will no longer be able to use the activity events to know which e.g. OS users were active on in a given day, e.g. if they were active on both mobile and desktop. We can still use other events to answer these types of questions, but its not ideal. Maybe as a fast follow we could introduce a new event property like os_used_on_day or activity_event_os_array and set-aggregate like above.

(In reply to Frank Bertsch [:frank] from comment #13)

  1. Once we've confirmed the user count numbers for the new fx_activity - active event, we can have the FxA pipeline stop sending the cert_created event. There will be some overlap in time where both are sent, but that is acceptable from an analysis perspective.

I can stop the flow of the original events whenever - just need to change the filter being used on the FxA side

Note: This plan can probably be acted on quickly but requires us to hash the user ids in the same way as the current FxA pipeline does. We are still waiting on confirmation from the FxA team on whether that is possible.

I can provide the HMAC key to you, can you talk about access control once it's been loaded into the Airflow cluster? I know HMAC's are one-way, but if the key is only visible to ops folks that would be ideal.

[1] Timestamp is a bit nebulous. Looking over the tables, I see a timestamp field, in addition to a receiveTimestamp field. There is a slight delay from timestamp -> recieveTimestamp. Because the table is partitioned on timestamp, I want to ensure we won't miss any activity when timestamp and recieveTimestamp occur on different dates, where a day boundary occurs between the two.

In fxa-amplitude-send we use the jsonPayload.Fields.time field when sending data to Amplitude, which I think corresponds to the timestamp field.

Flags: needinfo?(jbuckley)

I can stop the flow of the original events whenever - just need to change the filter being used on the FxA side

Perfect, we'll plan on that once this work is ready.

I can provide the HMAC key to you, can you talk about access control once it's been loaded into the Airflow cluster? I know HMAC's are one-way, but if the key is only visible to ops folks that would be ideal.

Yes, it should be. You can get in contact with Harold (cc'ed him here) to get the key added to Airflow. Once there it is not even visible to admins if stored as a secret, and we can still pass it in as a param to the query.

In fxa-amplitude-send we use the jsonPayload.Fields.time field when sending data to Amplitude, which I think corresponds to the timestamp field.

Great, we'll continue to do this.

Flags: needinfo?(jbuckley)

(In reply to Leif Oines [:loines] from comment #17)

I was too fast to submit my last comment:

I think it would be useful to also aggregate the following user properties in a similar way:

sync_active_devices_* (day, week, month), sync_device_count.

Leif, are these user properties filled in currently from the cert_signed event? If so, we will indeed need to send those along with the events.

If possible we should also aggregate fxa_services_used and then update it using $postInsert as documented here. Although :jbuck & :_6a68 - this postInsert function is only available using the identify API - would that be a problem for us?

We will be using what they call a "vacuum", which is essentially an uploaded CSV they they import. I'm not sure offhand what they do/do not support w.r.t. user properties, but we can request they make $postInsert available there. When we reach out about creating this vacuum we can ask about those options.

Ideally we would also do something similar for OS, Language, Country but reading between the lines here it doesn't appear we'll be able to do that, which is a shame. It means that we will no longer be able to use the activity events to know which e.g. OS users were active on in a given day, e.g. if they were active on both mobile and desktop. We can still use other events to answer these types of questions, but its not ideal. Maybe as a fast follow we could introduce a new event property like os_used_on_day or activity_event_os_array and set-aggregate like above.

Are these questions that are already answered with the cert_signed ping? If so we don't want to lose them. I believe we can do exactly what you mentioned earlier - take the unique set of e.g. OS'.

Flags: needinfo?(loines)

(In reply to Frank Bertsch [:frank] from comment #20)

(In reply to Leif Oines [:loines] from comment #17)

Leif, are these user properties filled in currently from the cert_signed event? If so, we will indeed need to send those along with the events.

Are these questions that are already answered with the cert_signed ping? If so we don't want to lose them. I believe we can do exactly what you mentioned earlier - take the unique set of e.g. OS'.

Yes, they are sent with the cert_signed event. You can use this biguery query as a reference for what is sent in the event_properties and user_properties fields. I believe the value for os is derived from the jsonPayload.fields.os_name column. Country and Language are also there.

Edit: Note that we are using $append for fxa_services_used here but we should really be using $postInsert per amplitude's advice (we just haven't made the change yet)

Flags: needinfo?(loines)

Ryan, do you have any context on the issues about making the HMAC key available to our Airflow instance?

:jbuck will have better context on this than I do. My main question is, who has the ability to calculate HMACs using this key? (Which is a slightly different question to "who has the ability to access this key?"). The threats to be concerned about here are:

  • Given a raw FxA userid, who is able to calculate the corresponding hashed userid in amplitude?
  • Given a hashed userid from amplitude, who is able to try to brute-force-guess the corresponding raw FxA userid?

Ideally the answer to both of these questions is "only a restricted set of operational staff at Mozilla". I've no objection to making that set bigger, but I wouldn't want to allow e.g. anyone at Mozilla to calculate HMACs using this key.

I believe there are some cases where a client might only generate the checked event on a given day, which would
cause them not to be counted towards DAU if we omit it. I will double check to see how often this happens

This definitely happens, because some of our OAuth tokens live for longer than 1 day. Including checked sounds valuable to me.

My opinions on aggregating services used etc are accurately represented by Leif's comments above, so I won't repeat any of it here apart from "+1".

Flags: needinfo?(rfkelly)

Ryan, the usual path is we create a table that has the exact events we want to send to Amplitude. Currently that means anyone with access to Telemetry data will have access to both the unhashed and hashed userids; however no link between them. We could lock down the hashed userids table, if that would alleviate any issues on your end.

The HMAC should only be available to Airflow jobs and ops.

We could lock down the hashed userids table, if that would alleviate any issues on your end.

If this is feasible to lock down that table, please do so. Thanks!

Currently that means anyone with access to Telemetry data will have access to both the unhashed and hashed userids

Unhashed FxA user IDs do not exist anywhere in telemetry data. The existing imports of FxA data that we do via Airflow read from fxa-prod project (which has the unhashed IDs) but hash the IDs as part of the query so that the resulting tables that live in the shared-prod project do not contain raw FxA IDs.

HMAC-hashed FxA UIDs already exist in shared-prod as they are passed in the sync ping (it's unclear whether these are hashed with the same key as the events sent to Amplitude).

So, I don't see any issue with telemetry users having access to the HMAC-hashed UIDs.

Unhashed FxA user IDs do not exist anywhere in telemetry data. The existing imports of FxA data that we do via Airflow read from fxa-prod project (which has the unhashed IDs) but hash the IDs as part of the query so that the resulting tables that live in the shared-prod project do not contain raw FxA IDs.

Thanks for clarifying that, Jeff. This also means we can't use those tables for the Amplitude import. Given this situation I agree that limiting access to the hashed data isn't a big concern.

Flags: needinfo?(jhirsch)

Frank and I met today and I agreed to provide a spec for how we should 1. aggregate the user and event properties for the rollup event 2. which operations we should use when sending the event to amplitude. Here goes

name event or user property aggregation special amplitude operation (if needed)
service event array none
oauth_client_id event array none
fxa_services_used user array $postInsert (we are changing this from $append)
sync_device_count user max none
sync_active_devices_day user max none
sync_active_devices_week user max none
sync_active_devices_month user max none
OS (os_name in the logs)* user array, if possible mode none
OS Version (os_version in the logs)* user array, if possible mode none
Language* user mode none
ua_version user array mode none
ua_browser user array mode none
Version (app_version in the logs, this is the version of the FxA server) user max none
Country and Region* user array mode none

edited to reflect comments below.

*I'm unsure if we can actually send these properties to amplitude as arrays. lmk if that ends up being a problem. i guess if we can use mode if we don't have much of a choice.

I think that's all of the properties that are relevant to the fxa_activity - * events. As I said, we should work under the assumption that all event types can take all of these properties, even if that's not true at the moment (many of them will sometimes be null). I also believe that if you don't specify an operation then it defaults to $set, which is what we want, but maybe we should verify that.

Thanks for providing that list, Leif. What is currently done for os, os_version, language, and country/region? Are they currently just set as the latest value from that user?

Flags: needinfo?(loines)

Yes they are. So i suppose you're right, it doesn't makes sense to send those as an array. Amplitude does the magic of pulling the correct value for the time interval of your chart. Since we are sending just one event per day now, there will be no way to establish multiple values of those properties per user. I guess that means we use mode for those. I'll edit my chart to reflect this.

Flags: needinfo?(loines)

Leif, why don't we add an array version of said fields as well? We can use the set append operation.

That works for me. Could be something along the lines of e.g. os_used_day

I've gotten confirmation from Amplitude that we can use the entire identity API capabilities with the vacuum ingestion system, so the above user props should be no issue. Here's what's left to do:

  • :jbuck to give :hwoo the HMAC key, who will make it available to Airflow jobs
  • :frank to write the export job for events and user properties
  • Amplitude needs to add that vacuum endpoint to the FxA project

It will be good to first test this on the dev FxA project.

I have sent the HMAC keys for stage and prod to :hwoo

Flags: needinfo?(jbuckley)

added to airflow vars as fxa_amplitude_hmac_secret_key_*

Draft PR for what we'd be exporting to Amplitude is available here.

I'm noticing what may be surprises, and I want to check in with the FxA folks:

Ryan:
~6.6% of user don't report any cert_signed events, but do report a access_token_checked or access_token_created. Is this expected? I was under the impression that right now all users were sending cert_signed. Are these users counted in Amplitude through some other event?

Leif:
Of the users with no cert_signed events, their user_properties are missing all fields except fxa_services_used. Any idea how we want to handle this? Should we try and get this added? For now we could send null for those properties.

Flags: needinfo?(rfkelly)
Flags: needinfo?(loines)

Its definitely NOT the case that all users will generate cert signed. Users of Sync and maybe a small number of other services do, but the rest will generate only the oauth access_token events. For the purposes of MAU/DAU we count users who generate ANY fxa_activity - * event (there is a "derived" event within amplitude that lumps all of these together).

Not all of the user properties make sense for services that generate the access_token events. For example, if a user uses monitor and NOT sync, the sync_active_devices properties should not be set at all (this is only a property of sync users). Once cert_signed goes away however, FxA WILL need to migrate those sync-specific user properties to be set by the access_token events. So for now, I think we should allow either event type to set the properties, but also allow the properties to be null (if we don't set the property for a given event, amplitude will continue to use the most recent value for that property, which is fine).

Flags: needinfo?(loines)

~6.6% of user don't report any cert_signed events, but do report a access_token_checked or access_token_created. Is this expected?

This sounds about right to be (assuming that it's looking at all users from all FxA-related products, many of which don't generate any cert_signed events).

Flags: needinfo?(rfkelly)

Hi all, we've successfully launched the pipeline and I am testing data in the FxAccts_Dev project. I will be loading one day of fxa_activity - active and $identify events. I've noticed the client count numbers may end up being slightly higher than what we're currently seeing in Amplitude, so we may need to take a look at which events are causing that.

In addition to the fields that Leif laid out in comment 27, we've added os_used_week and os_used_month. These are aggregated on our end, and it is straightforward to add more user properties that are aggregated in a similar way across various time periods.

Thanks so much for all your work on this Frank, here's what I'm noticing:

  1. The "official" amplitude user properties except for User ID (see attached screenshot to see what I'm referring to) are null. However I am seeing non-null values for custom event properties LANGUAGE , country , app_version (the latter looks like we should just use the official Version property). I am also seeing user_country, user_locale etc but they are all null. For user properties that we are not using array-agg on, is it possible to start using the "official" versions?

  2. I'm also seeing user properties fxa_uid, fxa_uid.data and fxa_uid.type, I'm not sure what those are (possibly some of these properties are just an artifact of your testing in which case feel free to ignore me).

  3. The aggregated os_used and sync_devices_used, fxa_services_used properties seem to be working, great!

  4. I queried the auth server logs for 2020-04-23 for COUNT(DISTINCT user_id) and got a number that was 1.08% higher than amplitude is showing. I cast the timezone to be PDT to match what amplitude uses. PM me on slack if you want the query/raw numbers. Unsure how I would follow up on this though, maybe you have ideas.

Note that I'm having the same problem with custom vs. "official" properties right now in trying to implement sync send_tab events. I'm chatting with Amplitude folks and we can hopefully apply the same solution there and here.

  1. The "official" amplitude user properties except for User ID (see attached screenshot to see what I'm referring to) are null. However I am seeing non-null values for custom event properties LANGUAGE , country , app_version (the latter looks like we should just use the official Version property). I am also seeing user_country, user_locale etc but they are all null. For user properties that we are not using array-agg on, is it possible to start using the "official" versions?

Let's see what happens with the other import, but we should be able to move those to top-level columns as we do for e.g. the Fenix import to get them available.

  1. I'm also seeing user properties fxa_uid, fxa_uid.data and fxa_uid.type, I'm not sure what those are (possibly some of these properties are just an artifact of your testing in which case feel free to ignore me).

I bet those are from some historical data in FxAccts_Dev. Do they have the associated fxa_activity - active events?

  1. The aggregated os_used and sync_devices_used, fxa_services_used properties seem to be working, great!

Great!

  1. I queried the auth server logs for 2020-04-23 for COUNT(DISTINCT user_id) and got a number that was 1.08% higher than amplitude is showing. I cast the timezone to be PDT to match what amplitude uses. PM me on slack if you want the query/raw numbers. Unsure how I would follow up on this though, maybe you have ideas.

There may be something odd going on around timestamps. I use a UTC 00:00:00 timestamp to load the data, it looks like I should be using a PDT one? That may help make the data match.

Flags: needinfo?(loines)

(In reply to Frank Bertsch [:frank] from comment #43)

I bet those are from some historical data in FxAccts_Dev. Do they have the associated fxa_activity - active events?

Ah yep, I think that's right.

There may be something odd going on around timestamps. I use a UTC 00:00:00 timestamp to load the data, it looks like I should be using a PDT one? That may help make the data match.

Actually, maybe what happened here is that you loaded the data from 2020-04-24 relative to UTC, timestamped it as 2020-04-24:00:00:00 but that ended up getting shifted to 2020-04-23:17:00:00 when displayed in amplitude, since the FxA project is set to be relative to PDT (I wish we would change this tbh, but I think too many people are used to it now). When I look at the numbers from the 24th relative to utc from the server logs, I get a closer number off only by +0.008% which I think would be good enough for government work.

Flags: needinfo?(loines)

Hey all, we fixed the tz offset and the data is now loaded in the correct day. Amplitude has confirmed that we need top-level properties for their "official" user properties, so we'll update that and then do a small test against prod, to confirm that user ids are matching. I'm planning on sending just a few users (O(10)) to prod to see that they already exist there. If that works, we should be good to open the gates on the new events and deprecate the cert_signed and co. events.

Still waiting on final verification from Amplitude about the version top-level field. Until then we've also updated our ETL to use pacific-based days rather than UTC. I'll need to backfill the dataset and then test against the FxA Dev project again. Once we're happy with those we'll be ready to send these events to prod.

We have updated the config and successfully added the Amplitude version property. We are ready to ingest into prod.

We've deployed the change to prod and we are currently ingesting both the new fxa_activity - active event, as well as the old events we will be replacing. We have two days of data in, an initial comparison can be found here.

Alex, Leif, I want to get sign-off from you both before we pull the plug on the cert_signed and oauth access_token events. If you have any questions or run into issues, let me know.

Flags: needinfo?(loines)
Flags: needinfo?(adavis)

Looking good to me so far. I was thinking of maybe keeping the old events through the weekend to see if the weekend dip in DAU was substantially different from what we'd seen in the past, but maybe that's not necessary. If alex is ok with pulling the plug on the old events earlier then that's fine with me.

Frank, it was my understanding that FxA (:jbuck) would have to do this? Or were you going to do it on your end? Doesn't matter who does it, just want to make sure we're on the same page.

Flags: needinfo?(loines) → needinfo?(fbertsch)

Frank, it was my understanding that FxA (:jbuck) would have to do this? Or were you going to do it on your end? Doesn't matter who does it, just want to make sure we're on the same page.

You are correct, :jbuck will need to turn them off. He indicated it's fast and easy on his end, probably updating that config you pointed me to.

Flags: needinfo?(fbertsch)

Leif, if you and Alex are okay with turning off the old events sooner rather than later, we can always do future comparison analysis on the BQ data. If there is a serious issue we can also backfill.

Let's go ahead and turn off the old events. Things look good on my end and I think they looked good to alex yesterday.

Great. Jbuck, can you do the honors? We need to disable the "fxa_activity - cert_signed", "fxa_activity - access_token_checked", and "fxa_activity - access_token_created" events.

Flags: needinfo?(jbuckley)
Flags: needinfo?(adavis)

New filter has been applied in production: https://github.com/mozilla-services/cloudops-infra/pull/2147

Flags: needinfo?(jbuckley)

New filter has been applied in production: https://github.com/mozilla-services/cloudops-infra/pull/2147

We can close this out! New events are flowed in daily and we've cut off the old ones.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED

Looking at the data in amplitude, I am a little concerned that we might not be de-duplicating these correctly: we have recorded a large number of fxa_activity - active events in the past 30 days. Reopening to investigate further

Flags: needinfo?(fbertsch)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

(In reply to Jared Hirsch [:_6a68] [:jhirsch] (Needinfo please) from comment #59)

Looking at the data in amplitude, I am a little concerned that we might not be de-duplicating these correctly: we have recorded a large number of fxa_activity - active events in the past 30 days. Reopening to investigate further

Current event count looks correct for 1-event per-user per-day. Divide the total by 30 to get ~DAU for FxA. Let me know if I'm missing something.

Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Flags: needinfo?(fbertsch) → needinfo?(jhirsch)
Resolution: --- → FIXED

Cool, thanks :frank!

Flags: needinfo?(jhirsch)

If it helps, here are the total events I see in Amplitude. I see the drop:
https://analytics.amplitude.com/mozilla-corp/chart/new/rtsvgi6

That is indeed helpful. Thanks, Alex!

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: