Closed Bug 1409247 Opened 7 years ago Closed 6 years ago

Export email-marketing events from SFMC, for import into Amplitude

Categories

(Marketing :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rfkelly, Unassigned)

References

()

Details

(Whiteboard: [fxa-waffle])

Ben, I'm quite sure there was a bug for this from when we did the amplitude proof-of-concept work, but I can't find it. Sorry. I would have liked to have it for reference. We'd like to get a daily export of the email marketing metrics events for FxA users, so that we can import them into our metrics tools such as Amplitude. IIUC there are "email sent", "email opened", and "email clicked" events tracked in SFMC, and we'd like to combine them with various other FxA activity metrics in order to understand the effect of these campaigns on acquisition, retention, etc. The events as we'd like to track are defined in Alex's "event taxonomy" document here: https://docs.google.com/spreadsheets/d/1G_8OJGOxeWXdGJ1Ugmykk33Zsl-qAQL05CONSeD4Uz4/edit#gid=879531055 Which defines them as having the following properties: type: one of "sent" or "opened" or "clicked" uid: the FxA uid of the user in question job_id: (actually I don't know what this is; Alex?) timestamp: the timestamp at which the event was registered sender: one of "moco" or "mofo" email_id: an identifier for what type of email was sent Ben, does this match up to your understanding of the events in SFMC? Have we missed anything that you think we should be including? What's the best way to get these events out of SFMC and into our own processing pipeline? In my head I imagine SFMC doing a daily CSV export to an S3 bucket, and then a script on the FxA side picking up files from there and processing them, but I'm open to other suggestions.
Flags: needinfo?(bniolet)
Whiteboard: [fxa-waffle]
Apologies for the long response. It got away from me. -- I can shed some light on the event taxonomy. The "jobid" is an integer that is unique to a particular email send. It's not useful from your end other than it's a unique identifier for which email was involved in the open, click or send. But the jobid is a necessary component for also sending you the email_id which is the internal name used to identify the email. That naming convention is easily parsed so you can extract whether it was transactional or promotional, what language it was, the email program it was associated with (e.g. sync, re-engagement, etc). We don't need to send you jobid, but it does provide a sure-fire way to group email sends (the same email message sent on different days will have different jobids if this helps explain jobid). An example email_id would come over as: MoCo_GLOBAL_FF_2017_NEWS_HTML_MOBILE_UPDATE_ALL_EN_EML_ We've been trying to standardize these so you can actually read what the email was about, who it was to, who sent it and what language it was in. The naming convention wizard and explainer is here: https://docs.google.com/a/mozilla.com/spreadsheets/d/11rvrVdF4fj5GaKOvlnLcNjnWB7U7yKV_MHmlBwRE-WA/edit?usp=sharing -- As far as click events, we could also provide details on what link was clicked. People who click on an email will often click on more than one link and this additional detail might be helpful. Without URL detail we wouldn't know if a click went to an unsubscribe link or the primary CTA (a big difference if we're trying to understand engagement and its impact on retention. Alternatively, we could just classify a click as on a CTA vs a preferences / unsubscribe click. That would be easy for us to figure out and label on our side before the data leaves our shop. Or we could do both (classify it and send you the URL). -- I think we're on the same page, but just to mention, an individual FXA_ID might be associated with a send, an open and a click or multiple opens and clicks. It would be my intention to send unique events, which means multiple opens would show up in the data and need to be digested by Amplitude. As far as getting these out of SFMC, yes, a csv export to an external destination is the way to go and can be done without a great deal of heavy lifting. We're already set up for Brick FTP, but I can't imagine it would be that difficult to send our file to S3. I'm adding this work to my team's upcoming sprint. Was was your time frame on turning this on? This sprint? Should be doable on our end if so.
Flags: needinfo?(bniolet)
> Was was your time frame on turning this on? This sprint? Should be doable on our end if so. Sounds great! We just kicked off the two-week cycle for train-99, and this is on our list, so if it's doable on your end let's definitely prioritize it together for this sprint. I'll follow up here with some more comments later today after my email and meeting backlog.
> Alternatively, we could just classify a click as on a CTA vs a preferences / unsubscribe click. That would be easy for us to figure out and label on our side before the data leaves our shop. For now, I think that covers the most important part. > an individual FXA_ID might be associated with a send, an open and a click or multiple opens and clicks That's perfect. It's good to have more detail than less. > Was was your time frame on turning this on? Michele was hoping to have this up before the release of 57 to measure the impact of all the emails you guys will be sending out. This is what we are aiming for which is why we have it in our current sprint. > We're already set up for Brick FTP, but I can't imagine it would be that difficult to send our file to S3. If this is the right way to go (or perhaps the easiest), let's do it. Note that Michele had also mentioned another tool that was used to pass email data to Beckon but I don't know if userIDs were also passed there. Perhaps it doesn't have the same level of details but I just thought I'd call that out since she'd mentioned it to me. In case your next sprint doesn't start for a few days, if we opt for dumps in FTP, perhaps a good first step in the meantime would be to share a small test file with PBooth so we he can start to investigate the data format that he'll be working with. (feel free to correct me if that wouldn't be useful or if it requires as much work as setting the whole thing up)
So the snag is as we go through our sprint planning: I'm on PTO for the rest of the week so it will be difficult if not impossible for us to get a data sample before next week. Our team is committed to getting this feed done this coming sprint (which starts Monday, Oct. 23).
> We're already set up for Brick FTP, but I can't imagine it would be that difficult to send our file to S3. I brought this up in our devops meeting earlier today. The consensus was that we *can* pick up data from BrickFTP if that's the only option, but it's far from ideal in terms of security and automation capabilities. If there's a way to get the data into something inside AWS then it'll be a significantly better setup. Ben, is there someone specific on your side who we should coordinate with on the infrastructure details here?
Flags: needinfo?(bniolet)
While I'm on PTO this week, I recommend starting a conversation with our CRM admin, Andrew Morales and Lisa Wright, who is filling in with Jdavis. I've copied Andrew on the bug and added a NI for him. Friends: we want to set up the ability to do an automated file drop from SFMC to an S3 bucket provided by FxA. To set this up, we will need to enter the proper details in: Email > Admin > Data Management > File Locations. Create a new file location and hope that the configuration options let us easily send to S3.
Flags: needinfo?(bniolet) → needinfo?(amorales)
Thanks Ben! I'm adding :jbuck and :jrgm to the bug for ops perspective on the FxA side.
Sorry for the delay on this folks. Hoping to tackle this over the next few days. Ill add a NI for Lisa W to see if she can help me connect all the dots and provide an update when I have more detailed info.
Flags: needinfo?(amorales) → needinfo?(lwright)
IIUC, we were originally going to try to prioritize this ahead of the 57 launch, so that we could observe effect of 57-related campaign emails. I guess that ship sailed :-) Ben, what's the ongoing priority of this for your team? We're going to turn our attention to importing other kinds of events into amplitude (e.g. sync events) but let me know whenever we're ready to pick this thread back up.
Hi Ryan. Yeah, I'd say we missed the 57 launch. I think there is still interest in getting this moving. Given some PTO I've got ahead (U.S. Thanksgiving) and then a week and then all hands and then closing for the end of the year, we have some challenges, but I'd like to still move forward as we can. What is your team's bandwidth and capability?
> What is your team's bandwidth and capability? It's on our Q4 OKRs so I'm happy to push forward with it at any stage. I think we can probably operate fairly asynchronously between the SFMC side and FxA side as bandwidth becomes available. We have the following questions for your team: * Can we get a small sample of the data in the format it would be exported? This will allow us to sanity-check it and to work on the data processing script on our side? * Can we successfully set up the daily dump into S3, or do we need to look into e.g. scripting around BrickFTP?
Flags: needinfo?(lwright)
I'm working this week on getting that sample ready + figuring out the S3 situation. Are there details on where in S3 to send it, in case I can figure this out without having to get SFMC involved?
Flags: needinfo?(rfkelly)
Also here are some more questions: 1) What cadence do you want for the file drop? Daily, hourly, weekly, etc. 2) What should be on the final file: all records or all new records since the last update? I'm thinking the latter since it will cut down on the file size once we get this up and running.
Without wishing to put any words into rfk's mouth, I think I can field some of these q's for you right now: > What cadence do you want for the file drop? Daily, hourly, weekly, etc. If it's a choice from those three, hourly. But if more frequent is possible, then that. FxA data is being sent every minute so the ideal scenario would be that we can see both sets events appearing on charts in near-realtime. It would also be helpful, if it's not too much trouble, to get a sample drop ahead of time so that we have something to code against in a dev environment soonish. > What should be on the final file: all records or all new records since the last update? Definitely the latter. Both for file size, as you mention, and because it's much simpler if the consuming code can just process all events in a file without worrying about maintaining a cursor or whatever. It also means it's easy to go back and re-process specific older drops in the event that some kind of error occurs at any point. > Are there details on where in S3 to send it, in case I can figure this out without having to get SFMC involved? As long as it's somewhere we can read from, anywhere I think. I guess that means somewhere within the cloud-services dev IAM, but you can give the bucket whatever name feels appropriate from your end. :rfk, apologies + please correct me if anything I said there is wrong.
(In reply to Phil Booth [:pb] from comment #14) > Without wishing to put any words into rfk's mouth, I think I can field some > of these q's for you right now: > > > What cadence do you want for the file drop? Daily, hourly, weekly, etc. > > If it's a choice from those three, hourly. But if more frequent is possible, > then that. FxA data is being sent every minute so the ideal scenario would > be that we can see both sets events appearing on charts in near-realtime. As far as cadence, I can try to see if the system will let me run it at intervals tighter than 1 hour. Might need a hack (set up three or four processes to run instead of just one). > > It would also be helpful, if it's not too much trouble, to get a sample drop > ahead of time so that we have something to code against in a dev environment > soonish. I'm working on a sample file as we speak. Unless I hear differently, this will be very similar to the pilot we ran early this year: One table for all events with FXA_ID and some meta data about the event (click, sent, open) the time and some identifying information about the email message associated with the event. > > > What should be on the final file: all records or all new records since the last update? > > Definitely the latter. Both for file size, as you mention, and because it's > much simpler if the consuming code can just process all events in a file > without worrying about maintaining a cursor or whatever. It also means it's > easy to go back and re-process specific older drops in the event that some > kind of error occurs at any point. Roger. > > > Are there details on where in S3 to send it, in case I can figure this out without having to get SFMC involved? > > As long as it's somewhere we can read from, anywhere I think. I guess that > means somewhere within the cloud-services dev IAM, but you can give the > bucket whatever name feels appropriate from your end. Afraid your S3 answer is over my head. I'm more of a tell-me-where-exactly-to-send-it-and-don't-assume-I-know-anything kind of person. > > :rfk, apologies + please correct me if anything I said there is wrong.
> Afraid your S3 answer is over my head. I'm more of a tell-me-where-exactly-to-send-it-and-don't-assume-I-know-anything > kind of person. No probs! I'm not an expert myself, so there's an element of the blind leading the blind to all this. However, I think you need to: 1. Get access to the cloud-services dev IAM in AWS. There's instructions for that here: https://mana.mozilla.org/wiki/display/SVCOPS/Requesting+A+Dev+IAM+account+from+Cloud+Operations 2. Sign in to that account here: http://cloudservices-aws-dev.signin.aws.amazon.com/ 3. Click on S3 in the list of services and create a bucket with whatever name you think is appropriate. You can copy settings from one of the other buckets listed, not sure which but it probably doesn't matter too much. 4. When you're setting up whatever it is you're setting up, give it the details of the bucket you created. You'll also need to provide credentials of some sort, probably an access key I guess, but I'm not sure where to find that. Maybe :rfk or someone has a better idea for that bit. Hopefully that's along the right lines. I might be wrong though! :-/
Ok. I'll dig in on that shortly. Thanks :pb
re: the S3 bucket, it'll probably be simpler all round if we can get :jbuck to create one with the right permissions and then just give you a pointer to it - easier to ensure security on our side, less mucking around in AWS on yours :-) I'l add him here for further comment. Even just setting up a BrickFTP dump in the meantime, to test out the rest of the pipeline, is probably a worthwhile intermediate step if that's simpler from your side Ben.
Flags: needinfo?(rfkelly) → needinfo?(jbuckley)
(In reply to Ryan Kelly [:rfkelly] from comment #18) > re: the S3 bucket, it'll probably be simpler all round if we can get :jbuck > to create one with the right permissions and then just give you a pointer to > it - easier to ensure security on our side, less mucking around in AWS on > yours :-) Whew. That makes me feel less frightened of bringing down the whole Internet. Thanks. > > I'l add him here for further comment. > > Even just setting up a BrickFTP dump in the meantime, to test out the rest > of the pipeline, is probably a worthwhile intermediate step if that's > simpler from your side Ben. I'm hoping to have a file for you on brick ftp by the end of (my) day. May have to be tomorrow depending on any unforseen complications. How many records do you consider to be a "small sample"? 20K? 20M?
20K sounds OK :-) I'm out or the rest of this week (actually, technically out today but just clearing my email...) so I'll leave the other questions to :pb. If I'm not around to hand off the sample file, Phil will need access to BrickFTP. I think you got me this by emailing someone last time, right Ben? Please do the same for Phil to get him set up.
Ok. The good news: 20k sample uploaded to BrickFTP. The bad news, you need to get your IP address whitelisted by IT to get into BrickFTP. The file is named Email_Interactions_Sample201711230.csv To get brick access, we will need a login and whitelisted IP for Phil Booth. I've NI's Vyas, who handles that for IT.
Flags: needinfo?(vswaminathan)
Thanks Ben! > To get brick access, we will need a login and whitelisted IP for Phil Booth. I've NI's Vyas, who handles that for IT. I was going to just grab this myself, but my IP address has changed in the meantime and I no longer seem to be able to access BrickFTP. Is it possible to access it from an ssh jumphost or via VPN or something to avoid the need to allowlist specific IP addresses?
Quick update: I have the file! I'll forward to :pb via some encrypted mechanism.
Flags: needinfo?(vswaminathan)
Hi all. Happy new year. I'm pretty close, I think, to having an export automation for the data. I just need a place to send it. :jbuck do you think you'd be able to give me a place to drop files on S3? Per Comment 18 https://bugzilla.mozilla.org/show_bug.cgi?id=1409247#c18
From meeting today: :jrgm is going to take a look at this this week, or switch the ni? back to :jbuck if he's not successful at getting a bucket set up.
Flags: needinfo?(jbuckley) → needinfo?(jrgm)
Hi guys. Not pressuring anyone, just wondering if there's an update.
Sorry Ben. I apologize for my delay on this. Do you control the user 'fxa-exacttarget' in our production AWS IAM? I can set up your access to an S3 bucket with that user's credentials if so.
Flags: needinfo?(jrgm)
Flags: needinfo?(bniolet)
Hrm. I don't control that.
Flags: needinfo?(bniolet)
:pmac Do you control the user 'fxa-exacttarget' in our production AWS IAM? I can set up your access to an S3 bucket with that user's credentials if so.
Flags: needinfo?(pmac)
I do have credentials for that account. It's the one we use for another s3 bucket from which we import fxa data (re bug 1338939). That should work fine for this as well.
Flags: needinfo?(pmac)
:pmac There's a bucket s3://net-mozaws-prod-sfmc-events/ in the prod iam that you should be able to read and write from with credentials for fxa-exacttarget. Sorry for my mental block on getting this done. Let me know if that is working for you.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Hi folks. I hate to say this, but it's looking like Salesforce isn't going to support S3 without additional integrations (from 3rd parties). Is there any chance at all that this file drop can be handled through SFTP? BrickFTP is setup and ready to go any time if you can grab the file from it.
Status: RESOLVED → REOPENED
Flags: needinfo?(rfkelly)
Resolution: FIXED → ---
Redirecting to :jbuck, who mentioned some familiarity with BrickFTP setup in the past.
Flags: needinfo?(rfkelly) → needinfo?(jbuckley)
Yep, we can do BrickFTP. :bniolet can you create a new BrickFTP service account just for this and send the credentials to :jrgm and I?
Flags: needinfo?(jbuckley) → needinfo?(bniolet)
I don't have permission to create credentials but I just filed an IT ticket to make it so. Will advise.
Flags: needinfo?(bniolet)
IT says jbuck and jrgm should have credentials. Do you?
Flags: needinfo?(jrgm)
Flags: needinfo?(jbuckley)
I had personal credentials for my old Foundation account, but I don't want to use personal credentials that stop working once either of us leaves. Is it possible to create a BrickFTP service account?
Flags: needinfo?(jbuckley)
Email from Harish Nair: >>Surely we can open a brick ftp service account for your request with a 5-year (more or less) validity - We would still need a username(Email) to be associated with that service account. Once you decide that, please hand over the email id and I will generate that service account for you guys in no time.
Flags: needinfo?(jbuckley)
One side-effect of ditching S3 at this point is that we need to re-write the script that imports the data to Amplitude. No dramas, but to try and help me get that done in time for the end of the quarter, can we answer the following questions: 1. Is everything in place for me to start testing changes to the script against BrickFTP instead of S3 right now? (by this I just mean some dummy data in the appropriate location, that I can use to simulate the real thing) If not can we get something in place ASAP? 2. Should I re-use the pbooth@... creds for BrickFTP that I was sent previously, or do I need to request new creds? Who should I request them from? 3. Do I need to get an IP address whitelisted in order to for my script to access BrickFTP? (it was suggested I would before, iirc) ni? bniolet, although maybe you're not the right person to ask...
Flags: needinfo?(bniolet)
I've created a new email group for jrgm and I and sent that to Harish, so that will get created soon enough. I think the sensible thing to do is just copy the data off BrickFTP into S3, and run the script against S3 on a cron
Flags: needinfo?(jbuckley)
BrickFTP does need white listing. Not sure how that works for this type of use, but I'm sure you guys can figure that out. It think :jbuck is onto something with his plan to just move it from Brick to S3.
Flags: needinfo?(bniolet)
> I think the sensible thing to do is just copy the data off BrickFTP into S3, and run the script against S3 on a cron Ah yeah, that makes way more sense. Should have realised that was what you were aiming at, sorry.
Okay, I've got working BrickFTP credentials. Ben, what path do these files live at?
Flags: needinfo?(jrgm) → needinfo?(bniolet)
right now they'd just go to Root Folder/etl/deg-exacttarget
Flags: needinfo?(bniolet)
So if that directory is acceptable, the next step is for me to send files. I have created files for each month from August through February. After those are sent, it'll be a file every hour. The timing snag, however, is that I'm going on PTO for a week and won't be able to get this started until the week of 4/26.
Flags: needinfo?(jbuckley)
Yep that directory is fine for me
Flags: needinfo?(jbuckley)
Ok guys. I'm ready to start sending files. I have the option to PGP encrypt these. Is that something we want to do? I'm a little remedial on this, but have a public key, if that's something we want. I have six months of backfill data and am re-testing the hourly feed.
Flags: needinfo?(jbuckley)
I don't think we'll need PGP for this if we're using BrickFTP
Flags: needinfo?(jbuckley)
I've been working with IT to resolve a bug we found when sending files to Brick. I am thrilled to say we've solved the issue and I've transferred three files over. This is half of the six month backlog I've been collecting while we get this project going. Di you want to ingest those first or are we ready for me to turn on the faucet? file path is: /etl/deg-exacttarget Files are: FXA_Email_Events_201708.zip FXA_Email_Events_201709.zip FXA_Email_Events_201710.zip
Flags: needinfo?(jbuckley)
Lets turn it on
Flags: needinfo?(jbuckley)
Hi folks. I've begun uploading large files to back-fill several months of data. I have encountered a time-out error when trying to create new files. This is a Salesforce-side problem. I'm still hopeful to have it addressed before all-hands. But if you want to start processing the old files they are named similarly to the names in comment 49.
Flags: needinfo?(jbuckley)
Daily feed is now up and running. I know we talked about hourly feeds, but currently it's not very easy to send multiple files on the same day and I didn't want that to block this project any further than it has already been blocked. There are 1-time files that backfill email events data from August 2017 and starting yesterday, daily files. The dailies will be styled as: FXA_Email_Eventsyymmdd.zip and FXA_Email_Events_Sentyymmdd.zip Not every day will have a _Sent version since we don't necessarily send email blasts every day. The other file should appear daily.
Thanks Lauren, this sounds great! I just need to speak to jbuck now so that we can ferry them over to S3 and run the import script [1] against them. :jbuck, I'm not going to be in SF unfortunately so maybe after you're back we can get together on vidyo to push this over the finish line? [1] https://github.com/mozilla/fxa-amplitude-send/pull/38/files
I've got a Node.js script that can pull down files from BrickFTP using the fxa-brickftp credentials. From an operational POV, I need to figure out how best to store the "file processed" flag for each file. I've submitted some data to the Amplitude FxA Stage project using phil's PR, with some fixes for the data parsing that I've pushed up. Some questions for you Lauren: 1) Will the data on BrickFTP ever expire, or will is stick around until someone deletes it manually? 2) Do you know what timezone the data is in?
Flags: needinfo?(jbuckley)
Timezones should be PT. Data will live on brick until it's deleted.
I've imported all files from today all the way back to FXA_Email_Events_20180518.zip. I'm going to let the import run over the weekend from FXA_Email_Events_201805.zip to FXA_Email_Events_201710.zip
I have imported all files from 2017-10 to 2018-07-04, and have a script to sync data to S3 working now: https://github.com/mozilla/fxa-brickftp-copy/
Friends, reviewing the data that's appearing in amplitude, it looks like some dates are missing, and I'm fairly comfortable saying that's likely something due to an error on my end. If I re-ran, say the month of February, would the script be able to parse out dupes? There isn't a single field that acts as a primary key per se. But if you combine FXA_ID, Job_ID, EventType and EventDate all records should be unique.
Adding NI for jbuck ^
Flags: needinfo?(jbuckley)
This can be done. Amplitude supports deduping based on an insert_id, which we've already set to be a hash of those fields you mentioned. So if we were to re-import a regenerated file, deduping should happen automagically via the insert_id. But there is a hitch. Amplitude only allows a 7-day window in which events can be resubmitted. After 7 days have passed from the time an event was submitted, submitting the same event again will show up as a duplicate. I'll leave jbuck's NI in place so he can chime in with when February's events were submitted and whether we have any window left.
We could also de-dupe manually of course, by running the script over the before and after files and submitting the delta. A little bit of legwork involved to modify the script, but definitely not an insurmountable problem.
Lauren: Yeah, I think we can dedupe based on the combination of those 4 fields, comparing to the existing February file. As :pb said Amplitude only maintains a rolling 7 day window, and I started submitting data on Friday last week. Let us know when the new February file has been uploaded
Flags: needinfo?(jbuckley)
This is working in production, running on an fxa-admin instance
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
It took a while, but I just resent the February files. Hope it's not too late to be useful. They are: FXA_Email_Events_022018.zip and FXA_Email_Events_Sent022018.zip
Status: RESOLVED → REOPENED
Flags: needinfo?(jbuckley)
Resolution: FIXED → ---
And they've both been processed now
Flags: needinfo?(jbuckley)

Closing this loop. It's fixed.

Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.