Last Comment Bug 384086 - Add a script and way to track update pings
: Add a script and way to track update pings
Status: RESOLVED FIXED
:
Product: addons.mozilla.org Graveyard
Classification: Graveyard
Component: Maintenance Scripts (show other bugs)
: 3.0
: All All
: -- normal
: 3.3
Assigned To: Justin Scott [:fligtar]
:
:
Mentors:
: 400647 (view as bug list)
Depends on:
Blocks: 387326
  Show dependency treegraph
 
Reported: 2007-06-11 15:59 PDT by Michael Morgan [:morgamic]
Modified: 2016-02-04 14:55 PST (History)
19 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
v1, patch to existing download script (13.18 KB, patch)
2007-10-29 10:23 PDT, Justin Scott [:fligtar]
no flags Details | Diff | Splinter Review
v2, new classes (31.90 KB, patch)
2007-10-30 09:10 PDT, Justin Scott [:fligtar]
no flags Details | Diff | Splinter Review
sql (1006 bytes, patch)
2007-10-30 09:25 PDT, Justin Scott [:fligtar]
no flags Details | Diff | Splinter Review
sql v2 (746 bytes, text/plain)
2007-10-31 18:50 PDT, Justin Scott [:fligtar]
no flags Details
new script, v3 (35.12 KB, patch)
2007-10-31 19:01 PDT, Justin Scott [:fligtar]
no flags Details | Diff | Splinter Review
v4, minor changes (35.64 KB, patch)
2007-11-04 15:56 PST, Justin Scott [:fligtar]
morgamic: review+
Details | Diff | Splinter Review
fixes, v1 (3.19 KB, patch)
2007-11-19 13:54 PST, Justin Scott [:fligtar]
no flags Details | Diff | Splinter Review

Description Michael Morgan [:morgamic] 2007-06-11 15:59:34 PDT
We need a way to parse AMO logs to keep track of daily update pings per add-on.  This data is much more useful to add-on developers (and AMO admins).  It is more accurate and can be used to follow trends.

Storage is the main issue here.  There is no column or table in the db to track download counts per add-on over time.  It might require a new database table with an addon_id, count, and date field.

We could also use an alternative format to store the info, but using the db seems natural... especially if we can offload most of the reads onto the r/o slave in the model used to query the new table.
Comment 1 Justin Scott [:fligtar] 2007-07-06 15:33:14 PDT
This will hopefully be ready next week.
Comment 2 Michael Morgan [:morgamic] 2007-09-05 10:42:29 PDT
Andrei, can you take a look at modifying the current parser to track update pings as well?  I imagine this would have to run on a separate cron job and need (probably) another table much like the one used for download counts.
Comment 3 Michael Morgan [:morgamic] 2007-10-21 22:05:24 PDT
*** Bug 400647 has been marked as a duplicate of this bug. ***
Comment 4 Justin Scott [:fligtar] 2007-10-26 15:39:29 PDT
I'm currently trying to figure out the best way to store the data that we'd like to save for the statistics panel.

I'd like to save the add-on version, status, application, application version, and operating systems so that add-on developers better know who is using their add-ons. This would be displayed in the dev cp statistics panel, probably in pie charts that show, for example, the percentage of Firefox versions being used.

The problem is that we can't have database fields to keep track of all of these as they are variable. So I'm thinking we might have to do something like:

+-------------+------------------+------+-----+------------+----------------+
| Field       | Type             | Null | Key | Default    | Extra          |
+-------------+------------------+------+-----+------------+----------------+
| id          | int(11) unsigned |      | PRI | NULL       | auto_increment |
| addon_id    | int(11) unsigned |      | MUL | 0          |                |
| count       | int(11) unsigned |      | MUL | 0          |                |
| version     | text             | YES  |     | NULL       |                |
| status      | text             | YES  |     | NULL       |                |
| application | text             | YES  |     | NULL       |                |
| os          | text             | YES  |     | NULL       |                |
| date        | date             |      | MUL | 0000-00-00 |                |
+-------------+------------------+------+-----+------------+----------------+

and within the fields, store stuff like:
version:
{'1.0': 33, '1.5': 45, '3.0': 34522}

status:
{'userEnabled': 3444, 'userDisabled': 234, 'blocklisted': 234}

application:
{'Firefox 1.5.0.3': 2, 'Firefox 2.0.0.8': 45525, 'Thunderbird 2.0a1': 45}

os:
{'WINNT': 3455, 'MacOSX': 234}

If anyone has better ideas, please let me know. Otherwise, I'm going to continue with this format, and do a few dev CP graphs to make sure it's workable before posting the patch here.
Comment 5 Justin Scott [:fligtar] 2007-10-27 01:07:12 PDT
I finished the script tonight and it seems to work well - I used it to generate http://people.mozilla.com/~jscott/devcp/stats3.png from parsing a few minutes of logs.

I want to look over it a bit more before requesting review, and I also want to figure out if we want it to be included in the download script or use a separate script.

The version I made is a patch on the download script since I thought it would be easier to go through all the logs once since we have to for the downloads anyway. But now I'm thinking we should make a separate script so that we can run it only once a week instead of once a day. The problem with that is running it once a week will miss people that only use Firefox at work or at home or whatever the condition is that misses that day every week. But I guess doing it on Wednesdays would have to suffice.

morgamic, oremj, clouserw - thoughts on download script vs separate script?
Comment 6 Justin Scott [:fligtar] 2007-10-27 01:11:14 PDT
I forgot to mention that I'm storing the data as serialized PHP arrays in the database instead of JSON. It would just be too much perf hit to manually convert to/from JSON (we don't use PHP5 with built-in functions).
Comment 7 Wil Clouser [:clouserw] 2007-10-28 10:35:45 PDT
(In reply to comment #5)
> morgamic, oremj, clouserw - thoughts on download script vs separate script?
> 

What's the drawback to doing it in the download script?
Comment 8 Michael Morgan [:morgamic] 2007-10-28 12:32:07 PDT
I think we'd have to see what kind of db load it creates.  I can imagine that since update pings are more common we'd be adding more load than the download script by itself -- doubling whatever we already have.

On the other hand, if we're using the same method of #/day it should be just 1 additional query per add-on, which isn't too bad.

If you make them the same script you also don't have to ue separate methods to keep track of which logs were processed.

So I'd say for now try to augment the download script and join them until you see a perf reason to split them up and schedule them at completely separate times.
Comment 9 Justin Scott [:fligtar] 2007-10-29 09:56:44 PDT
(In reply to comment #7)
> What's the drawback to doing it in the download script?
> 

There are about 65 update pings for every download count. It's going create substantially more database load. In the 3 second log sample I've been testing with, there are 31 download counts and 2037 update pings. From other data, we know that there are add-ons that have several millions active users and quite a few that have over a million. It took khan-vm 8 seconds to parse and update the download counts and update pings for those 3 seconds. We'll be doing hundreds of millions of queries every day in this alone.

(In reply to comment #8)
> On the other hand, if we're using the same method of #/day it should be just 1
> additional query per add-on, which isn't too bad.

Right now, the download count script does 4 queries for every count: 1 to lookup the add-on associated with a file id (which we could avoid by adding add-on id into download urls), 1 to update addons.totaldownloads, 1 to see if there's already an entry in download_counts for that add-on and date, and 1 to either insert or update download_counts for that date and add-on.

My update ping script currently does 3 queries per ping: 1 to find the add-on associated with a GUID, 1 to get the existing values of update_counts fields, 1 to either insert or update the update_counts fields.

I can get rid of one of those by caching the list of add-ons and their GUIDs and just looking those up from an array.

And I sorta think it might be better if we just do everything in the script and only write to the DB at the end for update pings.

All of this makes me think running this script every day will be a bit much.

I will add the GUID caching and time the script again.
Comment 10 Justin Scott [:fligtar] 2007-10-29 10:23:03 PDT
Created attachment 286574 [details] [diff] [review]
v1, patch to existing download script

This is the patch of the download script, but I'm going to make a separate script that stores everything in the script until the end to see if I can get better results.
Comment 11 Justin Scott [:fligtar] 2007-10-30 09:10:15 PDT
Created attachment 286689 [details] [diff] [review]
v2, new classes

This patch creates several new classes that are used in download counting and update ping counting, as well any any future needs of log parsing. The download counter doesn't really have any big changes other than re-organization for the class structure. It can definitely be optimized further, but that's a different bug.

The update ping counter stores all counts locally until it's done parsing all logfiles and then writes the changes to the db.

I ran some benchmark tests on a 112MB logfile covering 27 minutes and 32 seconds. (Apache creates new logfiles every hour and also when those files get to be around 112MB it seems)

To parse and update the 998 downloads in the file:
new class script: Execution time: 59.507404088974
old counter: Execution time: 105.80388593674

To parse and update the 1063770 update pings for 3054 different add-ons:
new class script: Execution time: 311.66478204727 (only 4 sec of which were db writing)

To parse and update downloads AND parse and update update pings:
new class script: Execution time: 674.10633087158
new counter v1 with constant writing: Execution time: 2397.2244679928

The new script takes a new parameter for what to parse, so for most days we would just parse download counts and then probably once or twice a week, we can have it count both update pings and downloads.

morgamic, if you need logs to test with, I would suggest creating a directory in trunk/bin and copying khan-vm:/home/jscott/public_html/addons/trunk/bin/test/access_2007-10-23-19.gz to it. Then, from trunk/bin/parse_logs you can run
php -f parse_logs.php ../test /tmp [downloads|updatepings|both]

The full set of logs for that day are in khan-vm:/home/jscott/accesslogs
Comment 12 Justin Scott [:fligtar] 2007-10-30 09:25:52 PDT
Created attachment 286691 [details] [diff] [review]
sql

I forgot to attach the new table structure. Crisis averted!

I'm posting this as a patch so that you spend time figuring out whether you should patch the file and the copy and paste it, or copy and paste and try to remove every +.

:)
Comment 13 Justin Scott [:fligtar] 2007-10-30 22:12:53 PDT
Comment on attachment 286689 [details] [diff] [review]
v2, new classes

Cancelling review - I want to make a few more changes with how logs are stored in logs_parsed.
Comment 14 Michael Morgan [:morgamic] 2007-10-30 22:36:37 PDT
I'm so confused.
Comment 15 Justin Scott [:fligtar] 2007-10-31 13:10:15 PDT
So, I realized that when we run the script to count update pings, we don't want to parse all logs that we haven't already parsed, we just want a specific day (probably the day before the cron is being run).

So, I'm making changes such that if you are parsing for update pings, you have to specify what day's logs you want to parse. This also means that I'm removing the "both" option, as parsing both downloads and update pings for a specific date isn't really feasible. (The times above seem to indicate doing them both at once is actually taking longer than doing them separately.)

Also, yesterday I ran an entire day's worth of access logs for update pings only. There were 44,658,395 pings parsed and updated in 22881.478312016 seconds (a little over 6 hours). Note that that number of pings does not include pings from GUIDs not hosted on AMO, which we still have to get fairly far in the process to figure out and they are quite common, so it's really parsing a good bit more than that in that time.

I'm working on a new patch for the above issues and a few other things.
Comment 16 Justin Scott [:fligtar] 2007-10-31 18:50:40 PDT
Created attachment 286917 [details]
sql v2

These are the updated SQL changes required for the script. Depending on when you test, they may already be applied to "remora" on khan-vm.
Comment 17 Justin Scott [:fligtar] 2007-10-31 19:01:41 PDT
Created attachment 286919 [details] [diff] [review]
new script, v3

Several changes in this version:
- For update pings, counts are saved to the db after every log file instead of after all logfiles have been parsed.
- There is now a verbose flag which will display the details of every count logged - otherwise, only major events will be displayed.
- As mentioned above, there is no longer an "all" flag - it's either downloads or updatepings, and if you use updatepings, you must specify the date you wish to parse.
- Count totals will be displayed at the end of the script.
- Downloads and updatepings record which logs they've parsed separately now, so if you parse a log for downloads and want to parse the same log for updatepings, you can. After the log is parsed and counts written, you won't be able to parse for the same action again.
- Because there are several more parameters, I changed the way the parameters are passed. You can see usage information by calling the script without any parameters, but, for example, you might do something like this now:
php -f parse_logs.php logs=/data/addons.mozilla.org temp=/tmp type=updatepings date=2007-10-31 v

With updatepings, leaving the v (verbose) out will make it a bit faster as it's not outputting millions of lines.

I'm going to re-run the update ping counter tonight on a day's worth of logs.
Comment 18 James Ross 2007-11-01 07:07:26 PDT
It seems you're planning to only use pings from a single day during the week, but I don't see anyone actually saying why or that such a decision/suggestion was made, only the implications that it has.

Obviously, if it's going to take 6 hours to process a day's worth of pings, the less you have to do the better, but could someone clarify the position on this?
Comment 19 Justin Scott [:fligtar] 2007-11-01 08:51:35 PDT
Download counts are unique, which is why every time one occurs it must be counted to give a grand total of downloads. Update pings are not unique - they happen every day by the same users and can't be totaled into one grand "active users" total - active users can only display the most recent count.

Active users counts vary every day of the week even if no one new is using an extension because different people use Firefox every day. Wednesday is the day that the most people use Firefox, so I suggest running the cron on Thursdays to parse Wednesday's logs. This will provide a consistent bar for historical graphs, whereas parsing everyday would be unnecessarily expensive and not be that useful for graphing purposes except possibly a one-time view showing the usual trend over weekdays.

In other news, the latest version of the script parsed the same entire day of logs in 10920.178269148 seconds (3 hours), even with writing after every log. I think turning verbose mode off definitely helped.

Update pings counted: 44658395
Update pings skipped: 7348336
Comment 20 Justin Scott [:fligtar] 2007-11-04 15:56:13 PST
Created attachment 287340 [details] [diff] [review]
v4, minor changes

A couple minor changes:
- Removed the timezone from the strtotime calculation which was causing times after 11pm to be counted for the next day. I suspect this bug is currently affecting download counts, but it's not a big deal and will probably be fixed by today's clock changes.
- When working on stats graphs for the dashboard, I noticed a few empty records being counted, so I added a bunch of !empty() checks in various places.

There are some differences between the numbers we're getting from this script and the numbers from cbeard's little perl script, so I'm investigating the differences there. I suspect that either his script has some problems (the regex I'm seeing is pretty questionable) or there are some minor fluctuations in the usage. The most popular add-on has a difference of about 400,000.

I'm currently thinking that after this is reviewed and we get it setup in production, I'll see how hard it is for us to pull logs from backups and try to get some update ping history so that we can launch the new stats pages (bug 387326) with the UI revamp with lots of data.
Comment 21 Michael Morgan [:morgamic] 2007-11-08 17:44:50 PST
Comment on attachment 287340 [details] [diff] [review]
v4, minor changes

Works great on logs and at about 3k entries a day it should be managable.  We will have to watch the table size and break it out into another db if needed.  I'm fine w/ once a week as we talked about.  Overall, this is awesome and thanks for working on this, Justin!
Comment 22 Justin Scott [:fligtar] 2007-11-08 19:33:30 PST
Checked in.

When the site is updated next (probably Tuesday) we can run the SQL and turn this on.

Add-on developers shouldn't expect to see the results of this right away - I'm going to be working on parsing previous log files to have some historical data for the new statistics page that will hopefully be going live with AMO 3.2 next month.

Leaving this bug open until the crons are updated next week.
Comment 23 Justin Scott [:fligtar] 2007-11-15 21:50:29 PST
This script is live now, but we had to make a temporary patch to fix the production server not recognizing the usage of $_GET for some reason. Will post a patch next week with a more permanent fix.
Comment 24 Justin Scott [:fligtar] 2007-11-18 19:19:45 PST
In addition to addressing comment #23, we're getting a bunch of

PHP Notice:  MySQL Error 1062: Duplicate entry 'access_2007-11-15-23.0.gz' for key 2 in /mnt/netapp/nfs/stats/amo_stats/bin/database.class.php on line 83

This is happening because mysql_num_rows() isn't working since we switch to the database class. Fortunately MySQL isn't letting this insert go through, so it's not an emergency to fix.
Comment 25 Justin Scott [:fligtar] 2007-11-19 13:54:41 PST
Created attachment 289383 [details] [diff] [review]
fixes, v1

This fixes some notices/warnings we've been getting.
Comment 26 Justin Scott [:fligtar] 2007-11-19 15:07:30 PST
Comment on attachment 289383 [details] [diff] [review]
fixes, v1

Cancelling review - found a big issue that will be fixed in bug 404079, and I'll just roll these changes into that fix.
Comment 27 Justin Scott [:fligtar] 2007-11-19 15:09:32 PST
Considering this fixed.

Other issues should be filed as separate bugs.

The results of the script will be made accessible when the Statistics Dashboard launches (and probably on a preview site before that). Watch the webdev blog (blog.mozilla.com/webdev) for details.

Note You need to log in before you can comment on or make changes to this bug.