Closed Bug 428287 Opened 16 years ago Closed 15 years ago

Create an Alexa top site report

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ted, Assigned: ryansnyder)

References

()

Details

Attachments

(1 file, 1 obsolete file)

We should have reports that track the # of crashes at Alexa top sites. Alexa offers free RSS feeds, but only for the top 25 in each category. They have a pay-per request web service for other stuff, that works out to $0.25/100 URLs.

http://www.alexa.com/site/devcorner/rss_feeds
http://aws.amazon.com/alexatopsites
Whiteboard: admin-panel
would this report just be a re-sorting of http://crash-stats.mozilla.com/topcrasher/byurl/Firefox/3.5 using the alexa order instead of frequency order?
Target Milestone: --- → Future
I assume this report would be public?
Assignee: nobody → ryan
Summary: Alexa top site reports → Create an Alexa top site report
Whiteboard: admin-panel
Target Milestone: Future → 1.2
Looks like Alexa has disabled the free RSS feeds. An AWS account is needed to access the top sites calls at http://aws.amazon.com/alexatopsites

Is there an existing AWS username/password that I should use to get this rolling?
tomcat has an alexa list, but I'm not sure it can be public.
(In reply to comment #4)
> tomcat has an alexa list, but I'm not sure it can be public.

its public, its the 1 million top site report http://s3.amazonaws.com/alexa-static/top-1m.csv.zip (updated daily they say :)
Yeah, the CSV file is available.  It's 9.5mb zipped and 21.x mb unzipped.  Grabbing that entire list of 1m records seems a little heavy for only needing the first 100 records.  But it could be done.
Grabbing once per day in a cron and then processing it into the format we want shouldn't be too bad. We should write the cron part in Python.

I'd say manually process the file and then document what you want the cron to do. (Unless you also want to write that part).
the list that topcat provide might be free, but we should check the terms of use to make sure our intended publishing and use of the data complies.

more thoughts on this.

the churn in that list is likely to be pretty small.  if we grabbed the list only once a week (or even once per month) seems like we wouldn't miss that much.  mostly changes in the ranking, but not new addtions. the ranking is not the key here, its mostly about spotting sites that have problems and getting more people engaged in diagnosing what the problem is both on the Firefox side and the website development team side.

this is another top list that might be interesting to expand out to 200 or 300 if we can handle the processing burden.   

for me, a list of week-old top 300-500 sites with crash rate info about each of the sites that updates every day would be more valuable than a newly updated top 100 list.   the former tries to engage more web developers to check out http://crash-stats.mozilla.com/ for information about how stable their site is for their firefox users.    that starts a cool virtuous circle of improvement, and investigations.   attacking a bigger pool of web developer would help out where red flags are going up on firefox interaction with top sites.

it would be interesting to figure out if we could get site specific crash bugs hooked into this report somehow as well.  just thinking out loud with the mock up below:


rank:

1.  facebook.com     #crash/day  crashratechg%  signatures...  bugs..........
2.  mail.google.com  #crash/day  crashratechg%  signatures...  bugs..........

I'm not sure how we populate the bugs part.  maybe if we find a url in the bug report flagged with the crash keyword or the crash signature titles that lars pulls from bugzilla.   the bugs part is hard, but a link to all the signatures that are associated with the domain of the crashing site is easier.
I've done some tinkering with the Alexa top sites .csv., and created a script that would download the file and basically truncate the file down to only contain the top 1000 sites.  It's currently truncated to 20k in size, and is opened and parsed every time it's used.

I've mocked up adding the top site ranking to the Top Crashers by Domain report.  
http://rsnyder.khan.mozilla.org/reporter/topcrasher/bydomain/Firefox/3.5.4

I've also mocked up a quick report ordered by Top Site Ranking:
http://rsnyder.khan.mozilla.org/reporter/topcrasher/bytopsite/Firefox/3.5.4

Let me know your thoughts about these.

Only the root domain is contained within the .csv file, so it's difficult to distinguish the site rank difference between google.com and mail.google.com, or facebook.com and apps.facebook.com.

I'm wondering if we should add the topsite rank as field in the urldims table, or to create a separate table altogether that we can join to.  The first mockup is a pretty lightweight implementation, and doesn't require any extra database interaction. But the latter request for a page of bugs ordered by site rank (and other future queries that may expand on this) might be best served via the database.

@choffman - I like your idea of providing awareness to external developers.  Allowing external developers to drill into their particular domain, even if their domain is not a top-crasher or a top 1000 site, is a great idea, and hopefully would encourage greater awareness to external developers on how their site is affecting the health of their users' browsers.  Is there a separate ticket open for this?
1: urldims has both domain and specific url. Alexa data seems not to have specific url. 
2: urldims changes only by growing, never by mutating rows.

Both points make a join table a better option than adding a column to urldims. For example:
  create table alexa_rank(
     domain text primary key,
     rank int default 10000,
     last_updated timestamp without time zone not null
  );
The timestamp is not strictly needed, but sometimes it is very nice to be able to see your footprints on changes. Default value for rank is in lieu of requiring it to be not null
Attached patch Patch 1 for #428287 (obsolete) — Splinter Review
Attached is the first patch for #428287.

This depends on the addition of 1 table - `alexa_topsites`.

This patch relies on a decent amount of post-query processing because of the lack of a logical join between alexa_topsites and urldims.  

A better solution may be to add a 2nd table titled `urldims_topsites` that will contain the join information between the 2 tables, but I don't have a firm of understanding of how / how often the urldims table is updated. 

This can be seen in action at:
http://rsnyder.khan.mozilla.org/reporter/topcrasher/bytopsite/Firefox/3.5.5

Top 1000 site information has also been added to the top crasher byurl and bydomain pages:
http://rsnyder.khan.mozilla.org/reporter/topcrasher/bydomain/Firefox/3.5.5
http://rsnyder.khan.mozilla.org/reporter/topcrasher/byurl/Firefox/3.5.5
Attachment #415319 - Flags: review?(ozten.bugs)
Attachment #415319 - Flags: review?(griswolf)
Attachment #415319 - Flags: review?(lars)
Attachment #415319 - Flags: review?(griswolf)
Attachment #415319 - Flags: review-
Comment on attachment 415319 [details] [diff] [review]
Patch 1 for #428287

topsite.php
Lines 52-54: db->escape is working too hard. Better to let db tool do parameter substitution e.g line 189 of this file

lines 67++ comment of function addTopsiteRankingToResults: Do not describe funtion

line 122++ This is scaring me. Comment is wrong. Code is probably much better
as a call to implode(). What is the point of the line breaks when any white
space works? And the line breaks are incorrect for the server/linux platform. 
The main concern is the huge list of OR'd statements each with a LIKE
component. Would really like to re-think this strategy down where the result
is used. 

Line 154: Suspect that you are actually reading the file two times: First
time to get size, second to read it. I would prefer this, which avoids the
issue (and handles the first explode for you)
(http://us3.php.net/manual/en/function.file.php)
$results = file($file);
...
foreach($results as $result) {
  $values = explode(",",rtrim($result,'\r\n'))
  $key = $value[0];
   ...

lines 198++ This is the actual scary part, but hidden by the call to
formatTopsiteWhereStatement. I'll ask Lars to look into it.

Getting loopy: Late. More later
Attachment #415319 - Flags: review?(ozten.bugs) → review-
Comment on attachment 415319 [details] [diff] [review]
Patch 1 for #428287

1) The table population should be a cron written in python. There shouldn't be any data in webapp-php/application/cache

I think it's too late to get this part written by lars or frank, but we should do this in 1.3

2) Adding alexa info should happen in the db layer.
Something like
http://pastebin.mozilla.org/687426

There should be no file access to alexatopsites.csv.

There should be little or no data merging, this should be in the db join.

3) This is up to lars/frank but I'd maintain a mapping table instead of doing the string concatenation. The alexa table would have an id and the mapping table would map urldims to alexa_topsites

4) Remove  getProductId it is unused.
Lots of refactoring here.  Removed the Topsite_Model.  Added JOIN clauses to the Top Crashers by Domain and by URL queries.

These changes will require an extra cron script to pull in the Alexa topsites .csv file, as all the php logic was removed during the refactor.
Attachment #415319 - Attachment is obsolete: true
Attachment #415505 - Flags: review?(ozten.bugs)
Attachment #415319 - Flags: review?(lars)
Comment on attachment 415505 [details] [diff] [review]
Patch 2 for #428287

This looks great. Lots of crazy SQL action, thanks for pushing that into today's patch.
Attachment #415505 - Flags: review?(ozten.bugs) → review+
Working on a separate cron script to import the alexa top sites on a weekly basis.  Committing topsites work.

==

Sending        application/controllers/topcrasher.php
Sending        application/models/topcrashersbyurl.php
Sending        application/views/topcrasher/bydomain.php
Adding         application/views/topcrasher/bytopsite.php
Sending        application/views/topcrasher/byurl.php
Sending        css/screen.css
Sending        js/socorro/topcrashbyurl.js
Transmitting file data .......
Committed revision r1533.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Forgot to add layout.php.

==

Sending        application/views/layout.php
Transmitting file data .
Committed revision r1534.
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: