Closed Bug 508973 Opened 15 years ago Closed 15 years ago

[AMO] Track Paypal Stats

Categories

(addons.mozilla.org Graveyard :: Statistics, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: clouserw, Unassigned)

References

Details

With the 5.0.8 launch tonight we've improved the paypal stats gathering (URL-wise) and need some help from metrics.  As of the 5.0.8 launch there are URLs that look like this in the logs:

https://addons.mozilla.org/en-US/firefox/addons/after_contribute/$addon_id/$uuid?________________

where ______________ is a bunch of data from paypal.  $uuid is inserted into our database when the user leaves AMO for paypal and should be sync'd up when they come back.

This should be run on log files from today and going into the future.

Steps for this bug:

1) Write a script that parses AMO logs for the format above and extracts $addon_id, $mc_gross (inside the paypal data), and $uuid.

2) The script should match the uuid and addon_id from the logs with an existing UUID and addon_id in the stats_contributions table.  Something like:

UPDATE stats_contributions SET amount=$mc_gross, uuid=null WHERE addon_id=$addon_id AND uuid=$uuid;

Note that this will clear out the UUID after it's been recorded.  If a UUID exists in the logs but not in the database it should be discarded.
1. What is the possible turn around time between the before and after request?
2. What do you want to do with after requests that contain a uuid from more than the time specified in question 1?
3. Do you want to just silently discard invalid after_contribute requests or log them for auditing purposes?
4. As a follow-up to question 3, If you clear out the uuid after the first matching after_contribute request, any subsequent after_contribute requests would be considered invalid.  Do you know that this is what you wish to do? Is it possible for someone to change their mind about the donation amount or to contribute twice in the same session? (i.e. with the same uuid)
5. Normally, metrics only exports to the AMO master DB once per day, are you really wanting us to start pushing data into the DB every hour?
6. What value is mc_gross supposed to be? Is the currency locale sensitive? (e.g. could it come in as dollars for some requests and pounds or euros for other requests?)

Ultimately, this request feels a bit more like an OLTP process rather than an OLAP process.  I think that you would be much better served by having an actual before_contribute and after_contribute page that performed the DB logic rather than relying on parsing data out of the log files.  Of course, even if you did implement that process, I'd be happy to help with parsing out the old data from before the process is in production.
(In reply to comment #1)
> 1. What is the possible turn around time between the before and after request?
I expect a couple minutes - however long it takes to complete a paypal transaction.  I guess it could be as long as it takes to login and for a session to timeout on paypal.  I don't see how this matters though - if it's in the logs and in our db it counts, otherwise it doesn't.

> 2. What do you want to do with after requests that contain a uuid from more
> than the time specified in question 1?
I don't care about time.  UUIDs are use-once.  I don't know why we'd put in a time limit?

> 3. Do you want to just silently discard invalid after_contribute requests or
> log them for auditing purposes?
Logging for auditing could be interesting if it's easy, but I wrote the bug with discarding in mind.

> 4. As a follow-up to question 3, If you clear out the uuid after the first
> matching after_contribute request, any subsequent after_contribute requests
> would be considered invalid.  Do you know that this is what you wish to do? Is
> it possible for someone to change their mind about the donation amount or to
> contribute twice in the same session? (i.e. with the same uuid)
Someone could contribute twice I guess even if they don't re-click on the contribute button unless paypal has something built in to prevent that.  I don't think that is a case I care about.  

> 5. Normally, metrics only exports to the AMO master DB once per day, are you
> really wanting us to start pushing data into the DB every hour?
I don't know where this is coming from, did I say something about updating every hour?  That said, sure, I'd be happy to have updates more often.

> 6. What value is mc_gross supposed to be? Is the currency locale sensitive?
> (e.g. could it come in as dollars for some requests and pounds or euros for
> other requests?)
It is a float.  It is always USD.

> Ultimately, this request feels a bit more like an OLTP process rather than an
> OLAP process.  I think that you would be much better served by having an actual
> before_contribute and after_contribute page that performed the DB logic rather
> than relying on parsing data out of the log files.  Of course, even if you did
> implement that process, I'd be happy to help with parsing out the old data from
> before the process is in production.
We already have a before_contribute page generating the UUIDs.  In the past doing real time operations on the db have brought it down so we try to do as much as possible offline via logs.  This seems like a perfect candidate to me - I don't see what an after_contribute page would do differently.
I mention the hourly vs nightly thing because you are talking about me inserting a record into your stats table on before and an update on after.  I could keep mirror tables in a separate set of tables in my DB to avoid that.

I ask about all the timing stuff because log processing doesn't have the same sense of session state and order that you have if you perform this OLTP logic directly in the webapp.
While I typically process the logs each hour sequentially, there is no guarantee because if an error occurs, it will go back and process the missing hour. I also process the logs within an hour in parallel. All these things mean that if the user bounces from one webserver in our cluster to another, or if the node has more than one log for that hour, or if I process the logs out of order, I might see the after request before the before. That means you'd see this as an incomplete donation.

Now, I can work around all this, it just require a more complex process. That's why I ask all these questions to make sure I can design that system properly.

When I was mentioning having the webapp handle this, I was thinking that the volume of donations traffic wouldn't be high enough to stress even one master server.  If the traffic was that high though, you could always partition the writes out to your slave cluster by just doing a mod partition on the addonid. Then you'd just have a federated view on the master that could be queried on a schedule to merge and insert the data into the canonical table on the master.
Blocks: 508469
Daniel, this (and #508972) is a blocker for #508469- can we get an ETA on this?  508469 is an important feature we want to get out in the next release.
(In reply to comment #3)
> I mention the hourly vs nightly thing because you are talking about me
> inserting a record into your stats table on before and an update on after.  I
> could keep mirror tables in a separate set of tables in my DB to avoid that.
I'm only talking about you doing an update when you run across a certain line in the log files.

> I ask about all the timing stuff because log processing doesn't have the same
> sense of session state and order that you have if you perform this OLTP logic
> directly in the webapp.
> While I typically process the logs each hour sequentially, there is no
> guarantee because if an error occurs, it will go back and process the missing
> hour. I also process the logs within an hour in parallel. All these things mean
> that if the user bounces from one webserver in our cluster to another, or if
> the node has more than one log for that hour, or if I process the logs out of
> order, I might see the after request before the before. That means you'd see
> this as an incomplete donation.
> 
> Now, I can work around all this, it just require a more complex process. That's
> why I ask all these questions to make sure I can design that system properly.
If the before request has happened, that means the row is in our database.  You aren't looking for a before request.  You are only looking for what I mentioned in comment 0 which will only happen after a before request has happened (and the row is in our db).

> When I was mentioning having the webapp handle this, I was thinking that the
> volume of donations traffic wouldn't be high enough to stress even one master
> server.  If the traffic was that high though, you could always partition the
> writes out to your slave cluster by just doing a mod partition on the addonid.
> Then you'd just have a federated view on the master that could be queried on a
> schedule to merge and insert the data into the canonical table on the master.
There are alternatives to what I'm asking, but I don't see how that is simpler/better than what I've outlined above.
Sorry if I'm being dense, but let me just double check that I understand correctly now:

When the user clicks the donate link, the page that generates the uuid will insert that record into the stats table itself?  So that when I am parsing through the logs and I see the after_contribute requests, I'll just be matching them up to data that was already inserted to the table by your system?

I'll follow up with Justin, but from the other bug it sounds like you have already gotten it fast tracked for me to work on starting Monday.

Once I make sure that I understand the requirements completely, I'll give an ETA.
(In reply to comment #6)
> Sorry if I'm being dense, but let me just double check that I understand
> correctly now:
> 
> When the user clicks the donate link, the page that generates the uuid will
> insert that record into the stats table itself?  So that when I am parsing
> through the logs and I see the after_contribute requests, I'll just be matching
> them up to data that was already inserted to the table by your system?
That's correct.

> I'll follow up with Justin, but from the other bug it sounds like you have
> already gotten it fast tracked for me to work on starting Monday.
> 
> Once I make sure that I understand the requirements completely, I'll give an
> ETA.
That wfm.  I haven't discussed ETA with anyone. I just write the bugs. :)
Will work on implementing tomorrow and Monday 2009-08-17.
Whiteboard: Due 2009-08-
The current processing required some enhancements to the overall AMO processing that I was testing in staging today.  It is running an overnight test on five days of data tonight to verify that there are no regressions.  Problems withstanding, it should be ready to deploy into production on Thursday.
I've run the after_contribute processing twice now and it just isn't finding the data properly.
There are a moderate number of requests to the after_contribute page from 2009-08-07 on, but only six of them contain a mc_gross attribute.

Let me know when you find out more information about what I should be looking for in the logs.
I modified the ETL to record a list of all the query parameter names found in either the older requests coming out of Paypal, or the after_contribute requests since the 2009-08-06 update.
It is recording a large number of them.  Would you like me to e-mail a list of the data for you to review?
Sure, that would be great
Moving bug out of Metrics queue for the time being. The current tracking mechanism isn't reliable.  AMO is investigating using IPN and talking with Paypal.
Component: Data/Backend Reports → Statistics
Product: Mozilla Stats → addons.mozilla.org
QA Contact: data-reports → statistics
Whiteboard: Due 2009-08-
AMO is using IPN, thanks.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.