Closed Bug 560628 Opened 14 years ago Closed 14 years ago

ADU daily UI changes for hang detection

Categories

(Socorro :: General, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ozten, Assigned: ozten)

References

Details

Attachments

(1 file, 1 obsolete file)

+++ This bug was initially created as a clone of Bug #560627 +++

Per Bug#559360 Comment#16 We need the ADU daily report to support hang detection.
This depends on an update to 
http://code.google.com/p/socorro/wiki/AduAPI
No longer depends on: 544940, 559360
Summary: ADU daily UI changes for hang detection → ADU daily service call changes for hang detection
Target Milestone: --- → 1.7
Summary: ADU daily service call changes for hang detection → ADU daily UI changes for hang detection
Version: 1.8 → 1.7
Assignee: nobody → chowse
Priority: -- → P1
See attachment #439544 [details] for mock-ups. The All/Crashes/Hangs options above the chart can be excluded.
looks good.
Assignee: chowse → ozten.bugs
We need to add report_type to the top_crashes_by_signature table.
report type can be any, crash, or hang.

There are a couple of ways to go here.
1) Add report_type column and calculate twice as many rows. The crashesByADU API would filter based on this column.

2) Create a table that maps prod/version/signatures to report_type. This can be updated via top crashers or another cron. The crashesByADU API would filter results by joining against this new table.
2.1) A signature would be a 'hang' if x% were for crashes with a hangid.

3) ??? Will noodle on this some more.
here is how I'm tracking the data until socorro gets updated

https://wiki.mozilla.org/CrashKill/Crashr#3.6.4
we could actually create one more calculated number column.  that would be to divide the number of "hang reports" by 2 to have a "number of hang incidences."

the number of "hang incidences" plus the number of crashes equals our new over all stability number.
(In reply to comment #4)
(In reply to comment #5)
Created Bug#565482
So I've been working on #2 from Bug#560628 Comment#3. I've hit a wall in getting my new queries to perform well do to the magnitude of data in the tcbs table. I've create Bug#565777 and re-opened Bug#545000.
Target Milestone: 1.7 → 1.8
This needs more testing, but I think it's ready for code review.

I tried really hard to work with existing tables, but eventually adding a new table that is at the same level of granularity as raw_adu and the daily.php report turned out to be a much easier path.

See comments in schema.py for db column documentation.
Attachment #447117 - Flags: review?(ryan)
Attachment #447117 - Flags: review?(laura)
Attachment #447117 - Flags: review?(lars)
Comment on attachment 447117 [details] [diff] [review]
Adds new daily_crashes db table as well as new startDailyCrashes.py cron

In models/daily.php::get(), could we add the new parameter at the end for backwards API compat?  I know it's not *that* relevant in this instance.

I also have a couple, possibly dumb, questions about the python parts of this patch.  I'll ping you in IRC.
Comment on attachment 447117 [details] [diff] [review]
Adds new daily_crashes db table as well as new startDailyCrashes.py cron

These issues are all relatively minor.  If the code works, and you don't want to change it, I'll switch to +.  


== TestSchema.py ==
ok


== daily_crash.py ==
several issues:

1 - the socorro.lib.psycopghelper module is deprecated.  You can use the socorro.database.database module that has nearly an identical interface.
  
2 - function "most_recent_day" - This function has some problems handling exceptions. Both exception handlers catch all exceptions including keyboardInterupt and SigKill.  If some thing goes wrong and the process hangs and must be killed, these exception handlers will necessitate "kill -9".  It is better to use this idiom:

 try:
   ...
 except Exception:
   ...

The python exception hierarchy places signals and the keyboard interrupt exceptions at a lower level than the class Exception.  This idiom will not catch those and allow the natural process killing mechanisms to work.

3 - function "most_recent_day" - the outer exception hander does not need to encompass the inner exception handler.  That inner one will catch everything.  The outer one could be shrunk to cover only lines 18-20.  The result is slightly different behavior that may be a correction to an unintended code path.

In the existing code, if the singleValueSql fails with some exception, then the outer handler falls back to a returning the value of "most_recent_day_from_product_visibility" without checking it for None like the other code path does when it calls that function.

This seems cleaner by using only 2 exception handlers instead of 3 and has no nesting:

  try:
    day = psy.singleValueSql(databaseCursor, "SELECT MAX(adu_day) FROM daily_crashes")
    if day:
      return day
    logger.info("daily_crashes was empty, using product_visibility to determine start date")
  except Exception:
    util.reportExceptionAndContinue(logger)
    logger.warning("Ouch, db error accessing daily_crashes, using product_visibility to determine start date")      
  try:
    day = most_recent_day_from_product_visibility(databaseCursor, logger)
    if day:
      return day
    else:
      return fail_most_recent_day(logger)
  except Exception:
    util.reportExceptionAndContinue(logger)
    return fail_most_recent_day(logger)

4 - function "most_recent_day_from_product_visibility" - this function has the side effect of rolling back the connection. That rollback should, perhaps, be placed in the exception handler itself to indicate the exception state.

5 - function "record_crash_stats" - 1st logging has a spelling error

6 - function "record_crash_stats" - the socorro.database.database module gives you a slightly simpler interface to the database.  Since this is not a multithreaded program, there is no need to instantiate a whole database connection pool.  A single connection will do:

  databaseConnectionPool = psy.DatabaseConnectionPool(config.databaseHost, config.databaseName, config.databaseUserName, config.databasePassword, logger)

becomes

  databaseConnection = db.Database(config)

and 

    _, databaseCursor = databaseConnectionPool.connectionCursorPair()

becomes

   databaseCursor = db.connection().cursor()

== aduByDay.py ==

I like the way that in "fetchCrashHistory" you've replaced a horrendous date equation with nothing special. It looks like the UTC to PST/PDT conversion has been done in the insert into the daily_crashes table.

== schema.py ==
ok

== startDailyCrash.py ==
ok

== dailycrashconfig.py.dist ==

rather than importing everything from 'commonconfig.py', most of the 1.7 scripts/configs have been converted to cherry picking values.  This makes the --help command line option list out only what is really needed to run the script rather than everything.  For an example, see the new processorconfig.py.dist.  Also, that idiom of exception handle was masking errors.  We've settled on dropping it and requiring that configuration be in ./config rather than having a fallback location (that was always wrong).
Attachment #447117 - Flags: review?(lars) → review-
sorry, I missed one.  We should at least stub out a .../socorro/unittests/cron/test_daily_crash.py.  Actually writing the damn things takes a couple times longer that writing the code itself.  Ideally we want unit tests for every function, but considering our time constraints, we may have to (rightly or wrongly) consider that to be a luxury...
Attached patch second stabSplinter Review
Attachment #447117 - Attachment is obsolete: true
Attachment #447117 - Flags: review?(ryan)
Attachment #447117 - Flags: review?(laura)
Attachment #447176 - Flags: review?(laura)
Attachment #447176 - Flags: review?(lars)
Second patch has changes from Comment #9 and Comment #10.

Thanks guys for the feedback.

(In reply to comment #11)
I agree, but these should be integration tests though as none of the logic in daily_crash is unit testable.
(In reply to comment #13)
The logic is unit testable using mock objects to replace and simulate the database.  It is extremely tedious and it is brittle.  I really hate writing that kind of test.  However, I cannot deny that on several occasions, the act of writing the tests revealed some subtle logic flaws.  It is also about the only way to test exception handling.
In addition to patch2 one slight tweak per lars IRC feedback:
  database = db.Database(config)
  databaseConnection = database.connection()
  try:
...
  finally:
    databaseConnection.close()
Attachment #447176 - Flags: review?(lars) → review+
Checking in *backend portion first* socorro scripts

Adding         scripts/config/dailycrashconfig.py.dist
Adding         scripts/startDailyCrash.py
Adding         socorro/cron/daily_crash.py
Sending        socorro/database/schema.py
Sending        socorro/lib/psycopghelper.py
Sending        socorro/services/aduByDay.py
Sending        socorro/unittest/database/testSchema.py
Transmitting file data .......
Committed revision 2105.

I'll commit frontend portion once Bug#567923 is done.
Target Milestone: 1.8 → 1.7
Committing frontend changes.

Sending        webapp-php/application/controllers/daily.php
Sending        webapp-php/application/controllers/products.php
Sending        webapp-php/application/models/daily.php
Sending        webapp-php/application/views/daily/daily_search.php
Sending        webapp-php/application/views/daily/index.php
Transmitting file data .....
Committed revision 2110.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment on attachment 447176 [details] [diff] [review]
second stab

I missed this...but it's good.
Attachment #447176 - Flags: review?(laura) → review+
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: