Closed
Bug 1474987
Opened 7 years ago
Closed 6 years ago
Register airflow task for sending project Savant events to Amplitude
Categories
(Data Platform and Tools :: General, enhancement, P1)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: klukas, Assigned: klukas)
References
Details
Attachments
(4 files)
We want to start sending "project Savant" events to Amplitude. A taxonomy of events is evolving in:
https://docs.google.com/spreadsheets/d/1G_8OJGOxeWXdGJ1Ugmykk33Zsl-qAQL05CONSeD4Uz4/edit?usp=sharing
I'm working with :shong to get a finalized taxonomy into the json format that telemetry-streaming needs and will be linking PRs to this bug.
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → jklukas
Priority: -- → P1
Comment 1•7 years ago
|
||
Assignee | ||
Comment 2•7 years ago
|
||
I'm out on PTO next week, but :sunahsuh will be back and :frank is now around if the taxonomy is ready.
My understanding of the steps here is that we need to:
- Open and complete a telemetry-streaming PR containing a new json file, once we have the taxonomy
- Merge the airflow PR attached to this bug, modifying as necessary to accommodate details of the telemetry-streaming PR
- Clean out the old apiKey files in S3
Assignee | ||
Comment 3•6 years ago
|
||
Still waiting on a taxonomy to be able to move forward here.
Comment 4•6 years ago
|
||
Sorry for taking so long to write the taxonomy. I've written the taxonomy docs and created a JSON file for importing here:
https://github.com/mozilla/SAVANT-study-analysis/blob/master/savantConfigFile.json
https://github.com/mozilla/SAVANT-study-analysis
Could you review the savantConfigFile.json file to make sure i've written it correctly?
Specifically, I wasn't super sure about these points:
1) For certain `amplitudeProperties` fields, I didn't want to take a value from the telemetry event but instead directly assign it a string. In those cases, I used a nested single quote:
```"amplitudeProperties": {
"meta-cat-1": "'feature'",
"meta-cat-2": "'addon'",
"addonid": "value"
}```
Note: "addonid": "value" is referencing the value field in telemetry events, while "meta-cat-1": "'feature'" should just be the string 'feature'
2) The filter logic. We should be able to just check where
```
experiments['pref-flip-savant-1457226-de-existing-users'] IS NOT NULL
experiments['pref-flip-savant-1457226-de-new-users'] IS NOT NULL
experiments['pref-flip-savant-1457226-en-existing-users'] IS NOT NULL
experiments['pref-flip-savant-1457226-en-new-users'] IS NOT NULL
```
in main summary but not sure how I should set up this logic for the main ping. Probably not a good idea to run this on all pings.
3) for the schema mapping, i used the MINIMUM mapping (so for example, event A is an event that will always have category, method, and object, however, it is identifiable just by it's category and method. In that case, I'm just supplying the category and method). This should be fine, right?
Assignee | ||
Comment 5•6 years ago
|
||
For (1), I don't believe there's currently support for that, but I think it should be straight-forward to support it. The single quote syntax seems reasonable.
For (2), I also don't think there's currently support for this, though it might be possible to add.
For (3), I believe you're correct. If an event fails any of the specifications, that schema mapping will be skipped.
I'm going to look at implementing (1) and (2) and see if I can implement some tests. Is there somewhere I can look to see some example pings that include Savant events? That would give me a kickstart for writing good tests.
Comment 6•6 years ago
|
||
Assignee | ||
Comment 7•6 years ago
|
||
Code to address (1) and (2) has been incorporated into the telemetry-streaming PR and it's now ready for review.
Comment 8•6 years ago
|
||
Hey klukas,
wanted to check a couple additional points with you:
1)
How can I get client properties into amplitude as client-properties? this bug seems to be addressing event properties specifically.
I have a mapping of client properties available here (https://github.com/mozilla/SAVANT-study-analysis/blob/master/AMP_CLIENTS.md). Is there a JSON file or something I should provide like with events?
2)
subsession Session properties: similar to client properties but slightly different in that they're not describing the clients directly but their behavior during a single subsession.
* subsession length
* active ticks
* uris opened
* searches made
these are common metrics that we use to track usage, and currently, there's no way for us to get that info in amplitude.
If we wanted to answer questions like, how does current bookmark usage correlate with future uris loaded, we'd need this info in amplitude in some way.
unfortunately, in Amplitude, there isn't subsession level properties, only client properties and event properties.
3)
non - event having pings: some clients use the browser but don't have any SAVANT events in their subsession. since the current job seems to only be sending SAVANT events into amplitude, from amplitude, these clients will not look like they're using the browser and thus have incorrect numbers for things like retention rate (for example the starting cohort will be undercounted since clients who don't have an event but used the browser won't be in amplitude).
----
I think we can address 2) and 3) in the same way, by creating an artificial daily-use savant event, with all of the 4 daily usage metrics (total usage hours, total active hours, total uris, and total searches) per client per day (by rolling up each client's main pings in a day), but not sure if that's the ideal solution engineering wise.
- Su
Comment 9•6 years ago
|
||
Assignee | ||
Comment 10•6 years ago
|
||
Solution for (1) is merged, and I'm currently working on a PR to ship "ping sent" events to amplitude including the ping-level info in (2) and which will also solve for (3).
Comment 11•6 years ago
|
||
Assignee | ||
Comment 12•6 years ago
|
||
We're working on some final changes to the PR, but I've gone ahead and loaded the past 10 days of Savant events into dev Amplitude to prove the process. It's ready there for initial feedback.
Assignee | ||
Comment 13•6 years ago
|
||
Had some issues with loading data yesterday, but we identified the problems (had to do with the configured filters and the particular dataset we were working with not matching up) and were able to load data. There's now more than 100k events in the dev project.
We're finishing up code review and waiting on feedback from analysts working with the dev data before integrating into Airflow and deploying to prod.
Assignee | ||
Comment 14•6 years ago
|
||
This is now running in prod airflow and we have a full backfill.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Updated•2 years ago
|
Component: Scheduling → General
You need to log in
before you can comment on or make changes to this bug.
Description
•