Closed Bug 384086 Opened 17 years ago Closed 17 years ago

Add a script and way to track update pings

Categories

(addons.mozilla.org Graveyard :: Maintenance Scripts, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: morgamic, Assigned: fligtar)

References

Details

Attachments

(3 files, 4 obsolete files)

We need a way to parse AMO logs to keep track of daily update pings per add-on.  This data is much more useful to add-on developers (and AMO admins).  It is more accurate and can be used to follow trends.

Storage is the main issue here.  There is no column or table in the db to track download counts per add-on over time.  It might require a new database table with an addon_id, count, and date field.

We could also use an alternative format to store the info, but using the db seems natural... especially if we can offload most of the reads onto the r/o slave in the model used to query the new table.
Assignee: nobody → fligtar
This will hopefully be ready next week.
Status: NEW → ASSIGNED
Blocks: 387326
Assignee: fligtar → nobody
Status: ASSIGNED → NEW
Andrei, can you take a look at modifying the current parser to track update pings as well?  I imagine this would have to run on a separate cron job and need (probably) another table much like the one used for download counts.
Assignee: nobody → sancus
Assignee: sancus → susanalaura
Assignee: susanalaura → lthomson
Target Milestone: 3.1 → 3.3
Assignee: laura → fligtar
I'm currently trying to figure out the best way to store the data that we'd like to save for the statistics panel.

I'd like to save the add-on version, status, application, application version, and operating systems so that add-on developers better know who is using their add-ons. This would be displayed in the dev cp statistics panel, probably in pie charts that show, for example, the percentage of Firefox versions being used.

The problem is that we can't have database fields to keep track of all of these as they are variable. So I'm thinking we might have to do something like:

+-------------+------------------+------+-----+------------+----------------+
| Field       | Type             | Null | Key | Default    | Extra          |
+-------------+------------------+------+-----+------------+----------------+
| id          | int(11) unsigned |      | PRI | NULL       | auto_increment |
| addon_id    | int(11) unsigned |      | MUL | 0          |                |
| count       | int(11) unsigned |      | MUL | 0          |                |
| version     | text             | YES  |     | NULL       |                |
| status      | text             | YES  |     | NULL       |                |
| application | text             | YES  |     | NULL       |                |
| os          | text             | YES  |     | NULL       |                |
| date        | date             |      | MUL | 0000-00-00 |                |
+-------------+------------------+------+-----+------------+----------------+

and within the fields, store stuff like:
version:
{'1.0': 33, '1.5': 45, '3.0': 34522}

status:
{'userEnabled': 3444, 'userDisabled': 234, 'blocklisted': 234}

application:
{'Firefox 1.5.0.3': 2, 'Firefox 2.0.0.8': 45525, 'Thunderbird 2.0a1': 45}

os:
{'WINNT': 3455, 'MacOSX': 234}

If anyone has better ideas, please let me know. Otherwise, I'm going to continue with this format, and do a few dev CP graphs to make sure it's workable before posting the patch here.
Status: NEW → ASSIGNED
I finished the script tonight and it seems to work well - I used it to generate http://people.mozilla.com/~jscott/devcp/stats3.png from parsing a few minutes of logs.

I want to look over it a bit more before requesting review, and I also want to figure out if we want it to be included in the download script or use a separate script.

The version I made is a patch on the download script since I thought it would be easier to go through all the logs once since we have to for the downloads anyway. But now I'm thinking we should make a separate script so that we can run it only once a week instead of once a day. The problem with that is running it once a week will miss people that only use Firefox at work or at home or whatever the condition is that misses that day every week. But I guess doing it on Wednesdays would have to suffice.

morgamic, oremj, clouserw - thoughts on download script vs separate script?
I forgot to mention that I'm storing the data as serialized PHP arrays in the database instead of JSON. It would just be too much perf hit to manually convert to/from JSON (we don't use PHP5 with built-in functions).
(In reply to comment #5)
> morgamic, oremj, clouserw - thoughts on download script vs separate script?
> 

What's the drawback to doing it in the download script?
I think we'd have to see what kind of db load it creates.  I can imagine that since update pings are more common we'd be adding more load than the download script by itself -- doubling whatever we already have.

On the other hand, if we're using the same method of #/day it should be just 1 additional query per add-on, which isn't too bad.

If you make them the same script you also don't have to ue separate methods to keep track of which logs were processed.

So I'd say for now try to augment the download script and join them until you see a perf reason to split them up and schedule them at completely separate times.
(In reply to comment #7)
> What's the drawback to doing it in the download script?
> 

There are about 65 update pings for every download count. It's going create substantially more database load. In the 3 second log sample I've been testing with, there are 31 download counts and 2037 update pings. From other data, we know that there are add-ons that have several millions active users and quite a few that have over a million. It took khan-vm 8 seconds to parse and update the download counts and update pings for those 3 seconds. We'll be doing hundreds of millions of queries every day in this alone.

(In reply to comment #8)
> On the other hand, if we're using the same method of #/day it should be just 1
> additional query per add-on, which isn't too bad.

Right now, the download count script does 4 queries for every count: 1 to lookup the add-on associated with a file id (which we could avoid by adding add-on id into download urls), 1 to update addons.totaldownloads, 1 to see if there's already an entry in download_counts for that add-on and date, and 1 to either insert or update download_counts for that date and add-on.

My update ping script currently does 3 queries per ping: 1 to find the add-on associated with a GUID, 1 to get the existing values of update_counts fields, 1 to either insert or update the update_counts fields.

I can get rid of one of those by caching the list of add-ons and their GUIDs and just looking those up from an array.

And I sorta think it might be better if we just do everything in the script and only write to the DB at the end for update pings.

All of this makes me think running this script every day will be a bit much.

I will add the GUID caching and time the script again.
This is the patch of the download script, but I'm going to make a separate script that stores everything in the script until the end to see if I can get better results.
Attached patch v2, new classes (obsolete) — Splinter Review
This patch creates several new classes that are used in download counting and update ping counting, as well any any future needs of log parsing. The download counter doesn't really have any big changes other than re-organization for the class structure. It can definitely be optimized further, but that's a different bug.

The update ping counter stores all counts locally until it's done parsing all logfiles and then writes the changes to the db.

I ran some benchmark tests on a 112MB logfile covering 27 minutes and 32 seconds. (Apache creates new logfiles every hour and also when those files get to be around 112MB it seems)

To parse and update the 998 downloads in the file:
new class script: Execution time: 59.507404088974
old counter: Execution time: 105.80388593674

To parse and update the 1063770 update pings for 3054 different add-ons:
new class script: Execution time: 311.66478204727 (only 4 sec of which were db writing)

To parse and update downloads AND parse and update update pings:
new class script: Execution time: 674.10633087158
new counter v1 with constant writing: Execution time: 2397.2244679928

The new script takes a new parameter for what to parse, so for most days we would just parse download counts and then probably once or twice a week, we can have it count both update pings and downloads.

morgamic, if you need logs to test with, I would suggest creating a directory in trunk/bin and copying khan-vm:/home/jscott/public_html/addons/trunk/bin/test/access_2007-10-23-19.gz to it. Then, from trunk/bin/parse_logs you can run
php -f parse_logs.php ../test /tmp [downloads|updatepings|both]

The full set of logs for that day are in khan-vm:/home/jscott/accesslogs
Attachment #286574 - Attachment is obsolete: true
Attachment #286689 - Flags: review?(morgamic)
Attached patch sql (obsolete) — Splinter Review
I forgot to attach the new table structure. Crisis averted!

I'm posting this as a patch so that you spend time figuring out whether you should patch the file and the copy and paste it, or copy and paste and try to remove every +.

:)
Comment on attachment 286689 [details] [diff] [review]
v2, new classes

Cancelling review - I want to make a few more changes with how logs are stored in logs_parsed.
Attachment #286689 - Flags: review?(morgamic)
I'm so confused.
So, I realized that when we run the script to count update pings, we don't want to parse all logs that we haven't already parsed, we just want a specific day (probably the day before the cron is being run).

So, I'm making changes such that if you are parsing for update pings, you have to specify what day's logs you want to parse. This also means that I'm removing the "both" option, as parsing both downloads and update pings for a specific date isn't really feasible. (The times above seem to indicate doing them both at once is actually taking longer than doing them separately.)

Also, yesterday I ran an entire day's worth of access logs for update pings only. There were 44,658,395 pings parsed and updated in 22881.478312016 seconds (a little over 6 hours). Note that that number of pings does not include pings from GUIDs not hosted on AMO, which we still have to get fairly far in the process to figure out and they are quite common, so it's really parsing a good bit more than that in that time.

I'm working on a new patch for the above issues and a few other things.
Attached file sql v2
These are the updated SQL changes required for the script. Depending on when you test, they may already be applied to "remora" on khan-vm.
Attachment #286691 - Attachment is obsolete: true
Attached patch new script, v3 (obsolete) — Splinter Review
Several changes in this version:
- For update pings, counts are saved to the db after every log file instead of after all logfiles have been parsed.
- There is now a verbose flag which will display the details of every count logged - otherwise, only major events will be displayed.
- As mentioned above, there is no longer an "all" flag - it's either downloads or updatepings, and if you use updatepings, you must specify the date you wish to parse.
- Count totals will be displayed at the end of the script.
- Downloads and updatepings record which logs they've parsed separately now, so if you parse a log for downloads and want to parse the same log for updatepings, you can. After the log is parsed and counts written, you won't be able to parse for the same action again.
- Because there are several more parameters, I changed the way the parameters are passed. You can see usage information by calling the script without any parameters, but, for example, you might do something like this now:
php -f parse_logs.php logs=/data/addons.mozilla.org temp=/tmp type=updatepings date=2007-10-31 v

With updatepings, leaving the v (verbose) out will make it a bit faster as it's not outputting millions of lines.

I'm going to re-run the update ping counter tonight on a day's worth of logs.
Attachment #286689 - Attachment is obsolete: true
Attachment #286919 - Flags: review?(morgamic)
It seems you're planning to only use pings from a single day during the week, but I don't see anyone actually saying why or that such a decision/suggestion was made, only the implications that it has.

Obviously, if it's going to take 6 hours to process a day's worth of pings, the less you have to do the better, but could someone clarify the position on this?
Download counts are unique, which is why every time one occurs it must be counted to give a grand total of downloads. Update pings are not unique - they happen every day by the same users and can't be totaled into one grand "active users" total - active users can only display the most recent count.

Active users counts vary every day of the week even if no one new is using an extension because different people use Firefox every day. Wednesday is the day that the most people use Firefox, so I suggest running the cron on Thursdays to parse Wednesday's logs. This will provide a consistent bar for historical graphs, whereas parsing everyday would be unnecessarily expensive and not be that useful for graphing purposes except possibly a one-time view showing the usual trend over weekdays.

In other news, the latest version of the script parsed the same entire day of logs in 10920.178269148 seconds (3 hours), even with writing after every log. I think turning verbose mode off definitely helped.

Update pings counted: 44658395
Update pings skipped: 7348336
A couple minor changes:
- Removed the timezone from the strtotime calculation which was causing times after 11pm to be counted for the next day. I suspect this bug is currently affecting download counts, but it's not a big deal and will probably be fixed by today's clock changes.
- When working on stats graphs for the dashboard, I noticed a few empty records being counted, so I added a bunch of !empty() checks in various places.

There are some differences between the numbers we're getting from this script and the numbers from cbeard's little perl script, so I'm investigating the differences there. I suspect that either his script has some problems (the regex I'm seeing is pretty questionable) or there are some minor fluctuations in the usage. The most popular add-on has a difference of about 400,000.

I'm currently thinking that after this is reviewed and we get it setup in production, I'll see how hard it is for us to pull logs from backups and try to get some update ping history so that we can launch the new stats pages (bug 387326) with the UI revamp with lots of data.
Attachment #286919 - Attachment is obsolete: true
Attachment #287340 - Flags: review?(morgamic)
Attachment #286919 - Flags: review?(morgamic)
Comment on attachment 287340 [details] [diff] [review]
v4, minor changes

Works great on logs and at about 3k entries a day it should be managable.  We will have to watch the table size and break it out into another db if needed.  I'm fine w/ once a week as we talked about.  Overall, this is awesome and thanks for working on this, Justin!
Attachment #287340 - Flags: review?(morgamic) → review+
Checked in.

When the site is updated next (probably Tuesday) we can run the SQL and turn this on.

Add-on developers shouldn't expect to see the results of this right away - I'm going to be working on parsing previous log files to have some historical data for the new statistics page that will hopefully be going live with AMO 3.2 next month.

Leaving this bug open until the crons are updated next week.
Keywords: push-needed
This script is live now, but we had to make a temporary patch to fix the production server not recognizing the usage of $_GET for some reason. Will post a patch next week with a more permanent fix.
In addition to addressing comment #23, we're getting a bunch of

PHP Notice:  MySQL Error 1062: Duplicate entry 'access_2007-11-15-23.0.gz' for key 2 in /mnt/netapp/nfs/stats/amo_stats/bin/database.class.php on line 83

This is happening because mysql_num_rows() isn't working since we switch to the database class. Fortunately MySQL isn't letting this insert go through, so it's not an emergency to fix.
Attached patch fixes, v1Splinter Review
This fixes some notices/warnings we've been getting.
Attachment #289383 - Flags: review?(morgamic)
Comment on attachment 289383 [details] [diff] [review]
fixes, v1

Cancelling review - found a big issue that will be fixed in bug 404079, and I'll just roll these changes into that fix.
Attachment #289383 - Flags: review?(morgamic)
Considering this fixed.

Other issues should be filed as separate bugs.

The results of the script will be made accessible when the Statistics Dashboard launches (and probably on a preview site before that). Watch the webdev blog (blog.mozilla.com/webdev) for details.
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Keywords: push-needed
Resolution: --- → FIXED
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: