Fix scripts that populate aggregate reporting tables (topcrash is broken)

RESOLVED FIXED in 0.6

Status

task
P1
normal
RESOLVED FIXED
11 years ago
8 years ago

People

(Reporter: justin.gallardo, Assigned: ozten)

Tracking

Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 3 obsolete attachments)

Reporter

Description

11 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13
Build Identifier: Trunk

There needs to be some scripts written that will be executed by cron to populate and update the data in the aggregate reporting data. This allows large queries such as top 100 crashers, and top crashers of all time to be accessed much quicker than the current setup.

Reproducible: Didn't try
Assignee: nobody → justin.gallardo
Status: UNCONFIRMED → NEW
Ever confirmed: true
Priority: -- → P1
Target Milestone: --- → 0.5
Reporter

Updated

11 years ago
Blocks: 426410
Reporter

Comment 1

11 years ago
Posted file First attempt at the script (obsolete) —
I attached my first attempt at a script to populate the summary table. The only thing it leaves out is the trend, as I am not 100% sure how/what we want to store there.
Attachment #313231 - Attachment mime type: text/x-python → text/plain
Reporter

Updated

11 years ago
Status: NEW → ASSIGNED
Comment on attachment 313231 [details]
First attempt at the script

Hey, good so far - comments:
s/aggreate/aggregate/

Would like to see either a separate config file or passed args from command line -- probably separate file.  We might be able to use one of the existing configs -- see http://code.google.com/p/socorro/source/browse/trunk/socorro/lib/config.py.dist -- we could piggyback off that pretty easily.

See a couple of things that worry me as far as scaling.  SELECT * FROM reports and the for loop over that result set will probably not work very well.

So -- consider how can we make this do updates for only date ranges that the script knows are new and I think we solve a lot of that.
Reporter

Comment 3

11 years ago
Posted file Second try.
This version of the script is using the socorro.lib.config configuration file now, and also has gotten rid of the ugly 'SELECT * FROM reports' query. Now, it checks the summary table for its latest update, and then queries reports for all records new since that time.

Thoughts?
Attachment #313231 - Attachment is obsolete: true
Comment on attachment 313238 [details]
Second try.

Can you add these as text/plain?
Attachment #313238 - Attachment mime type: text/x-python → text/plain
Could join the last_updated query and the reports query into one by using a subquery.

Would like better exception handling -- what if data is in an unexpected format?  What if crashes is empty?  That said, looks ok functionally but think of how you would write tests for these methods, and how we should handle any failures in data getting, mutating or setting by using try/except statements.  For example, what would we output to cron in the event of a catastrophic failure?

This looks a little strange:
    if row[16] == "mac" or row[16] == "Mac OS X" or row[16] == "#3C3":

I don't think the value is going to be 'mac' or '#3C3' ever.  First element in the platforms list is the value to key off of, so: row[16] == "Mac OS X"

Comment on attachment 313238 [details]
Second try.

Lars, could you give Justin some feedback?
Attachment #313238 - Flags: review?(lars)
Which socorro is being used, the old one or the refactored new one?  In the new one, socorro.lib.config is deprecated in favor of each script having its config file.  See socorro/monitor/config.py and socorro/processor/config.py.

Since monitor and processor both now allow command line parameters, I think that other scripts based on the socorro code should follow suit.  See .../scripts/startMonitor.py for an example how the class ConfigurationManager coupled with the form of the new style config files makes command line stuff trivial.

line 28: map(itemgetter(1), ranks)  - makes a list of the items drawn from ranks and then does nothing with it.  

line 39: ..."dbname='" + config.db_name + "' user='" + config.db_user + "' host='" + config.db_host + "' password='" + config.db_pass + "'"... is fine PHP, but inefficient in Python because each plus sign results in another complete copy of the string being created.  A more “Pythonic” solution that results in only one string is: “dbname='%s' user='%s' host='%s' password='%s'” % (dbname, user, host, password).  Though, the new configuration files offer the entire databaseDSN as one string...

line 41: since this program runs in cron, logging error messages is more valuable than printing to stdout.  See .../scripts/startMonitor.py for an example of how to set it up.

In lines such as 55: key = (row[3], row[4], row[6]), a comment that indicates which corresponding columns you're shooting for would be helpful to facilitate some random maintainer in the future understand the code more quickly. This is one of those annoyances that hits all programs that use the Python DBAPI.

line 56: the call to .has_key is deprecated in future versions of Python.  The correct way to do it now is: if key not in summary_crashes:

line 75:       summary_crashes[key]['uptime'] = (row[10] + summary_crashes[key]['uptime']) / crash_count[key];
not sure I understand what's happening here.  Is summary_crashes[key]['uptime'] supposed to hold an average?  Say we had four crashes all with an up time of 10.  On processing the first one we'd have a value of 10.  On the processing the second we'd get (10 + 10) / 2.0 = 10.  On processing the third, we'd get (10 + 10) /3.0 = 6.6667  On the fourth we'd get (10 + 6.6667) / 4.0 = 4.16667

lines 88,89: not necessary to return and rebind the dictionary summary_crashes.  The dictionary is effectively passed by reference into the functions.  Modifications of the dictionaries within the functions will persist outside the functions.

lines 95,96 & 104,105:  since these lines execute sql queries within a loop, you want the queries to use positional rather than absolute parameters.  That way the server will plan and compile the queries only once rather than each time the queries are submitted.  Replace all substitution parameters with %s with no quotes regardless of the type.  Move the sql definition outside the loop and then move the parameters tuples into the cur.execute call in this manner:

  sql = "SELECT rank FROM topcrashes WHERE product=%s AND branch=%s AND signature=%s ORDER BY last_updated DESC LIMIT 1"
  cur.execute(sql, (data['product'], data['version'], data['signature']))

Reporter

Comment 8

11 years ago
Lars: Thanks for the input! I was able to improve the script quite a bit using the libs from socorro. Thanks for the pointers!

I just created /socorro/cron in svn, with topcrashes.py and topcrashes_config.py.dist, with the latest version of the script that I have. 

I was able to implement logging, configuration options, command line switches, and added some more error handling into the script.

I added the 'initMode' flag as well. With this flag, it will perform a full select of the reports table, as oppose to grabbing everything since the last update. Useful for doing the first run of data. We may want to make this only grab two weeks of data or the like, just to prevent a massively huge dataset being calculated.
Comment on attachment 313238 [details]
Second try.

Thanks for the comments, Lars.
Attachment #313238 - Flags: review?(lars)
Reporter

Comment 10

11 years ago
Merged into trunk in r433. Still waiting to setup on staging for testing.
Target Milestone: 0.5 → ---
Target Milestone: --- → 0.6
Assignee: justin.gallardo → nobody
This was setup on staging and tested, however we are seeing inconsistency in data.  Next step is to understand why our lists differ from what is seen in a simple query on reports.*
Target Milestone: 0.6 → ---
Summary: Need scripts to populate aggregate reporting tables. → Fix scripts that populate aggregate reporting tables (topcrash is broken)
Assignee: nobody → aking
Target Milestone: --- → 0.6
If we do find out that the database is correct we just need to correct the reporter query.
Assignee

Comment 13

11 years ago
Looking at the data, the aggregate numbers in topcrashes didn't match the individual numbers in reports table. I found two seperate off by 1 errors in code. Both are fixed by eliminating if / else which initialize state.
Attachment #342657 - Flags: review?
Assignee

Updated

11 years ago
Attachment #342657 - Flags: review?(morgamic)
Attachment #342657 - Flags: review?(lars)
Attachment #342657 - Flags: review?
Assignee

Comment 14

11 years ago
Comment on attachment 342657 [details] [diff] [review]
Patch to fix counters for off by 1 errors

Making reviewers specific lars or morgamic.
I could be wrong, but I think you need to patch the PHP file. Since the rewrite from Python to PHP, I think all the web-app Python code is dead.

Maybe this? http://code.google.com/p/socorro/source/browse/trunk/webapp-php/application/controllers/topcrasher.php
There's a Python script run from cron that fills a table for the webapp to use:
http://code.google.com/p/socorro/source/browse/#svn/trunk/socorro/cron
Ah, wow. Well I stand corrected. :)
On reviewing the code:

Your first change on lines 78-81 involves testing to see if row has anything in it before bothering to proceed.  The old code will simply raise an exception that is caught in main.  Your patch, tests and if finding a problem, reports the problem.  However, it then lets the program proceed, where
The next sessas it fails and raises an exception on the next line.  A more 'Pythonic' solution would be like this:

try:
  start_time = row[0]
except IndexError:
  logger.warn("topcrashers table is empty...")
  raise

The next section of your changes looks fine, flattening 'if' hierarchies is always a nice thing to do.  However, one thing looks suspicious: lines 130:132.  In the case where the signature is not in summary_crashes[key], you set crash_count[fullKey] = 1.0.  Immediately thereafter, it is incremented to 2.  Is this correct?
Assignee

Comment 19

11 years ago
Good points Lars.

I have incorporated your two suggestions.

Initalization to 0:
    summary_crashes[key][signature]['users'] = 0
          crash_count[fullKey] = 0

        crash_count[fullKey] += 1

and using raise:

if row:
      start_time = row[0]
    else:
      msg = "topcrashers table is empty, use -I to run in initializing mode"
      logger.warn(msg)
      raise msg
Attachment #342657 - Attachment is obsolete: true
Attachment #342937 - Flags: review?(lars)
Attachment #342657 - Flags: review?(morgamic)
Attachment #342657 - Flags: review?(lars)
actually, raising strings as exceptions is deprecated in Python.  Only instances of classes derived from BaseException can be raised in future versions.  In v2.5, Python will issue a deprecation warning.  In v2.6, I believe it is an error.

To avoid issues in the the future when we inevitably migrate, you can change the code to:

raise Exception(msg)
Assignee

Comment 21

11 years ago
What is a good Exception type?
It looks like TypeError(msg) is the default.

In Java I would use IllegalStateException or something like that... What do you suggest for Python?
It's perfectly fine just to use Exception(msg) as a generic error.  Typically most modules define their own exception hierarchy.  However, in this case that's probably overkill since the calling program doesn't bother to respond to exceptions.  If you were to go that route, however, it would look like this:

class NotInitializedException(Exception):
  def __init__(self):
    super (NotInitializedException, self).__init__("topcrashers table is empty,...")


On continuing the review, I see that you changed the crash_count from a
floating point value to an integer.  I suspect that it was originally set a
floating point value to coerce floating point rather than integer division on
line 25.  You may want to check to see if integer division is acceptable for
that calculation.
Assignee

Comment 23

11 years ago
Using Exception(msg) and reverted to initialing count as a float.
Attachment #342937 - Attachment is obsolete: true
Attachment #342954 - Flags: review?(lars)
Attachment #342937 - Flags: review?(lars)
it looks fine to me -
Assignee

Comment 25

11 years ago
Checked in r626
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Did this get deployed in production? If not, bug 460704 is a dupe of this. If so, then see that bug which says it's still not correct.
Duplicate of this bug: 460704
Removing the push-needed keyword means the fix was pushed to production? The topcrasher list still reports wrong numbers (as what I reported in bug 460704). When the correct ones will be shown?
Assignee

Comment 29

11 years ago
Sorry for the shuffle of bug #s. There is a patch under review attached to 461977, release TBD.

Updated

11 years ago
Depends on: 466549
Assignee

Updated

9 years ago
Attachment #342954 - Flags: review?(lars)
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.