Open Bug 1136014 Opened 6 years ago Updated 6 years ago

Sync monthly active users for v1 executive dashboard

Categories

(Cloud Services :: Metrics: Dashboard, defect, P2)

x86
macOS
defect

Tracking

(Not tracked)

People

(Reporter: kparlante, Assigned: kparlante, NeedInfo)

References

Details

Attachments

(2 files)

From our new data dictionary:

1. A (sync) user is a FxA account holder who has at least one Firefox profile connected to Sync
2. a monthly active user is a USER (desktop or mobile) that has synced >0 items in last 28 days

For the executive dashboard, we'd like to measure "active sync users" directly (2), as described above. It would be great if we could make that query regularly, and then feed the daily results to the shared heka instance.

(We have a similar example of this for FxA accounts & verified accounts; a query runs on the FxA Admin server: https://github.com/mozilla-services/puppet-config/pull/489)

If we can get (1) as well, that helps compute a % of active users.

It won't make it on the executive dashboard, but we'd also like to know:
(3) Distribution of number of devices connected to SYNC_ACTIVE_USER
(4) Among profiles with 2+ devices, how many have at least desktop and at least one mobile
IIUC we have a "node admin box" where we could run metrics queries against the tokenserver DB.

For (1) do we care about attrition, or is it "every FxA user who has even connected to sync"?  Measuring that should be a fairly simple COUNT on the tokenserver users table.  I can work up something using the linked FxA code as an example.

For (2) is this the thing that's done in FxA using HyperLogLog on individual requests?  I don't *think* we can synthesize this based on data from the tokenserver db.  I suspect we'll have to emit a custom metric on the tokenserver side saying "FxA user BLAH just requested a token" and accumulate/count those downstream in heka.

Katie, does that sound reasonable?

We have a pending tokenserver push in Bug 1129346, it you want these urgently then we could derail that and try to piggyback the extra metrics on that deployment.
For (3) we could probably run some periodic metrics gathering on each storage node and send them out for heka to accumulate.

I'm not sure how to approach (4), since IIRC we can't really tell device type from the encrypted contents of the DB.  Perhaps we could include user-agent information in the metrics from (2) and calculate the counts downstream.
(In reply to Ryan Kelly [:rfkelly] from comment #1)
> For (1) do we care about attrition, or is it "every FxA user who has even
> connected to sync"?  Measuring that should be a fairly simple COUNT on the
> tokenserver users table.  I can work up something using the linked FxA code
> as an example.
We should check with Saptarshi, but I think "every FxA user who has ever connected" is a good enough for this pass. I suspect ultimately we'll change to "FxA user who is also an active Fx user", but that requires measuring from FHR profiles.

> For (2) is this the thing that's done in FxA using HyperLogLog on individual
> requests?  I don't *think* we can synthesize this based on data from the
> tokenserver db.  I suspect we'll have to emit a custom metric on the
> tokenserver side saying "FxA user BLAH just requested a token" and
> accumulate/count those downstream in heka.

Yes, right now we're computing "active user" by looking at cert/sign (which is effectively measuring sync as everyone else uses oauth). I think we want a direct measurement of sync. Yes, emitting a custom metric from tokenserver just as you describe, and then doing the HyperLogLog thing downstream in Heka sounds like a good solution.

> We have a pending tokenserver push in Bug 1129346, it you want these
> urgently then we could derail that and try to piggyback the extra metrics on
> that deployment.

Apologies for the last minute request, and I promise I won't do it often, but yes, lets piggyback the extra metrics on that deployment.
(In reply to Ryan Kelly [:rfkelly] from comment #2)
> For (3) we could probably run some periodic metrics gathering on each
> storage node and send them out for heka to accumulate.

Sounds reasonable.

> I'm not sure how to approach (4), since IIRC we can't really tell device
> type from the encrypted contents of the DB.  Perhaps we could include
> user-agent information in the metrics from (2) and calculate the counts
> downstream.

+1 to including user-agent information in the metrics

(3) and (4) don't have the same time pressure btw (though seems like a good idea to go ahead with the user agent information). The visualization for (3) and (4) should go on sync-dashboard when its ready.
So it sounds like the minimal things we need to make this happen are:

* A script to run periodically on either the node-admin box, or on the tokenserver webheads, that does basically a "SELECT COUNT(*) FROM users WHERE replaced_at IS NULL" on the tokenserver users table.  This will give us a pretty accurate picture of (1), the total number of users who have ever talked to sync.

* Add the FxA user identifier (or a derivative of it) to the existing request.summary log for the tokenserver.  This will give you enough information to pull out the data you want downstream of the app (since there's really only a single API endpoint on the tokenserver, anyone who hits this endpoint can be considered an active sync user)
Katie, assuming this stuff is already flowing somewhere you can see it, can you take a look at the existing request-summary logs from tokenserver and see if adding the uid on there would fit your needs?  I'm thinking you would just do a HLL count on uid for this log stream.
(In reply to Ryan Kelly [:rfkelly] from comment #6)
> Katie, assuming this stuff is already flowing somewhere you can see it, can
> you take a look at the existing request-summary logs from tokenserver and
> see if adding the uid on there would fit your needs?  I'm thinking you would
> just do a HLL count on uid for this log stream.

I'm not seeing the request.summary logs on the tokenserver-app-logs index (just requests.packages.urllib3.connectionpool pings). Looks like they're being filtered out currently:
https://github.com/mozilla-services/puppet-config/blob/master/shared/modules/shared/templates/hekad/elasticsearch.toml.erb

Presumably that can be remedied easily. Perhaps it got filtered out because timing information was being logged more frequently?

Adding a unique identifier to what is logged here would be perfect:
https://github.com/mozilla-services/mozservices/blob/master/mozsvc/metrics.py
whd tells me the request.summary data was filtered out because it was basically duplicate information to the nginx logging, we can turn it back on in the next few days.
(In reply to Ryan Kelly [:rfkelly] from comment #5)
> So it sounds like the minimal things we need to make this happen are:
> 
> * A script to run periodically on either the node-admin box, or on the
> tokenserver webheads, that does basically a "SELECT COUNT(*) FROM users
> WHERE replaced_at IS NULL" on the tokenserver users table.  This will give
> us a pretty accurate picture of (1), the total number of users who have ever
> talked to sync.

There's a job running on the node manager that counts the following and submits them as a custom CloudWatch metric: https://github.com/mozilla-services/puppet-config/blob/master/sync_1_5/modules/node_manager/templates/node_alloc.pl.erb

In StackDriver we have six weeks of archived one minute counts for the available, capacity, and current_load columns.  We also have a special column, node-capacity, which is percentage of capacity used for sending alerts.

It wouldn't be that hard to add something to upload a json document to S3 for these metrics.
Here's a quick attempt to satisfy (2) by including a unique user identifier in the request-summary log.  I went with HMAC_SHA256(key, uid) with the secret key just set as a config option on the server.  This seems like it should be good enough or now.

(Tarek, the background here is that we want to be able to tie together server-side metrics for FxA users across all their different services, but we don't want to make it trivial to link this back to actual user accounts.  So we HMAC as an obfuscation with a key that's only known to limited services).
Attachment #8568900 - Flags: review?(tarek)
(In reply to Ryan Kelly [:rfkelly] from comment #2)
> For (3) we could probably run some periodic metrics gathering on each
> storage node and send them out for heka to accumulate.

If we do that, I'd prefer not to add too much load to the individual database servers.  Would it be acceptable to sample this data instead of counting everything?  If so, what sample size would be required?

> I'm not sure how to approach (4), since IIRC we can't really tell device
> type from the encrypted contents of the DB.  Perhaps we could include
> user-agent information in the metrics from (2) and calculate the counts
> downstream.

Heka has access to this data, and we have the UA pie charts in some of our dashboards.  I suspect this might require a relatively complicated, and memory hungry, heka filter.  :whd thoughts?
(In reply to Katie Parlante from comment #0)
> From our new data dictionary:
 
> It won't make it on the executive dashboard, but we'd also like to know:
> (3) Distribution of number of devices connected to SYNC_ACTIVE_USER
> (4) Among profiles with 2+ devices, how many have at least desktop and at
> least one mobile

It might be nice to roll up Bug 1130044 type metrics, average collection size, and total number of items in Sync as well.
(In reply to Bob Micheletto [:bobm] from comment #11)

> Heka has access to this data, and we have the UA pie charts in some of our
> dashboards.  I suspect this might require a relatively complicated, and
> memory hungry, heka filter.  :whd thoughts?

It's probably doable with two (large) bloom filters of uids. When processing a message, do membership checking of the correct bloom based on UA, and if it's not there, add it. Then check if it's in the other bloom and if so increment the count (if it's in both, do nothing).
Part 2 is a script that will log the total user count.  I've done it in the style of the existing FxA script from https://github.com/mozilla-services/puppet-config/pull/489/files#diff-0f4f298bc66040ed6d31688c95b4c4c8R49, including timezone logic and JSON format.

The idea would be to just run this regularly from cron and slurp it into heka like we do for the above-linked FxA script.

Tarek, if these look OK to you then we should be able to prep this for deploy sometime tomorrow.  How quickly we can get it out and have the data flowing after that, I'm not sure.

Katie, there will obviously still be that 28-day lag on the monthly active user count.  I don't think it would be *too* much of  stretch to use the FxA monthly-active count to fill in for the stat in the meantime, if you need it.  It would over-report slightly but be in the right ballpark.

I'll be in a training session all day tomorrow but will try to check in on this at some point.
Attachment #8569004 - Flags: review?(tarek)
Comment on attachment 8569004 [details] [diff] [review]
script to log user-count metric

Rfk asked Tarek or I for a quick feedback on IRC.

As I think Tarek is out for the rest of this week, I took a quick look at it.

I'm not familiar with the code base but I don't see anything weird in there so I'm okay to get this marked off with my name.

r=ametaireau
Attachment #8569004 - Flags: review?(tarek) → review+
Depends on: 1136895
I filed Bug 1136895 to get the new metrics out into stage and prod.
(In reply to Bob Micheletto [:bobm] from comment #11)
> (In reply to Ryan Kelly [:rfkelly] from comment #2)
> > For (3) we could probably run some periodic metrics gathering on each
> > storage node and send them out for heka to accumulate.
> 
> If we do that, I'd prefer not to add too much load to the individual
> database servers.  Would it be acceptable to sample this data instead of
> counting everything?  If so, what sample size would be required?

Saptarshi is the person to give us feedback on sample size (for "Distribution of number of devices connected to SYNC_ACTIVE_USER" use case).
Flags: needinfo?(sguha)
(In reply to Bob Micheletto [:bobm] from comment #12)
> It might be nice to roll up Bug 1130044 type metrics, average collection
> size, and total number of items in Sync as well.

Agreed. rfkelly and ckarlof are working on a new ping for fxa reliers (sync being the first one): these fields sound like good candidates.
(In reply to Ryan Kelly [:rfkelly] from comment #14)
> Katie, there will obviously still be that 28-day lag on the monthly active
> user count.  I don't think it would be *too* much of  stretch to use the FxA
> monthly-active count to fill in for the stat in the meantime, if you need
> it.  It would over-report slightly but be in the right ballpark.

Good point, thanks for the heads up. I'll talk to jjensen about what he wants, but that should be fine.
 
> I'll be in a training session all day tomorrow but will try to check in on
> this at some point.

Thank you so much for getting this in!
Attachment #8568900 - Flags: review?(tarek) → review+
Depends on: 1144390
No longer depends on: 1144390
With the prod deploy of tokenserver (bug #1143906) we're receiving metrics from the app, and should have sync MAU for the dashboard in 30 days (2015-04-17).

https://heka.shared.us-west-2.prod.mozaws.net/#plugins/filters/Sync-1_5-ActiveUsers30Days
Prioritizing "gap" bugs as P2.
No longer blocks: 1135847
Priority: -- → P2
Taking this from Ryan, I'll add to dashboard after 4/17
Assignee: rfkelly → kparlante
Component: Operations: Metrics/Monitoring → Metrics: Dashboard
You need to log in before you can comment on or make changes to this bug.