Closed Bug 600473 Opened 14 years ago Closed 12 years ago

top_crashes_by_signature table does not match reports table

Categories

(Socorro :: Webapp, task, P3)

x86
macOS

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: alqahira, Assigned: rhelmer)

References

()

Details

Attachments

(1 file, 1 obsolete file)

The Camino 2.1a1pre topcrash report currently contains 100% of 11 crash incidents in 9 signatures for a reporting period of 2 weeks ("2010-09-14 21:00:00 through 2010-09-28 21:00:00").

However, I crashed twice in the last week in [@ nsJSArgArray::cycleCollection::UnmarkPurple(nsISupports*) ], so at the very least these crashes should be showing up in the report.

bp-2596014f-dad5-46e0-8706-be82e2100927	9/27/10	11:33 PM
bp-a1f276fa-673a-4637-865f-30e4e2100924	9/24/10	4:18 PM

However, these crashes are nowhere to be seen.

Doing an Advanced Query for 2 weeks prior to today for 2.1a1pre returns 57 signatures and 70 crash incidents: http://crash-stats.mozilla.com/query/query?product=Camino&version=Camino%3A2.1a1pre&platform=mac&range_value=2&range_unit=weeks&date=09%2F29%2F2010+00%3A01%3A49&query_search=signature&query_type=exact&query=&build_id=&process_type=any&hang_type=any&do_query=1

That does show [@ nsJSArgArray::cycleCollection::UnmarkPurple(nsISupports*) ] as the most frequent crash, and includes my 2 crashes in the count of 3.

These shouldn't be so vastly out-of-sync.
Rob, can you take a look at this please?
Assignee: nobody → robert
Status: NEW → ASSIGNED
I have confirmed that the dates look correct in the admin/branches screen for Camino 2.1a1pre, and I can reproduce what Smokey is seeing in comment #0. Lars suspects the topCrashesBySignature cronjob, so I've been investigating that. No good theories yet.

Aravind sent me the production log for startTopCrashesBySignature.py, is does look unusual lately:

2010-09-24 05:02:28,489 WARNING - MainThread - No data for interval Start: 2010-09-24 02:00:00, size=1:00:00
2010-09-24 06:02:30,415 WARNING - MainThread - No data for interval Start: 2010-09-24 02:00:00, size=1:00:00
2010-09-24 06:02:30,426 WARNING - MainThread - No data for interval Start: 2010-09-24 03:00:00, size=1:00:00
2010-09-27 13:02:28,285 WARNING - MainThread - No data for interval Start: 2010-09-27 10:00:00, size=1:00:00
2010-09-27 14:02:29,574 WARNING - MainThread - No data for interval Start: 2010-09-27 10:00:00, size=1:00:00
2010-09-27 14:02:29,585 WARNING - MainThread - No data for interval Start: 2010-09-27 11:00:00, size=1:00:00
2010-09-27 15:02:32,050 WARNING - MainThread - No data for interval Start: 2010-09-27 10:00:00, size=1:00:00
2010-09-27 15:02:32,062 WARNING - MainThread - No data for interval Start: 2010-09-27 11:00:00, size=1:00:00
2010-09-27 15:02:32,079 WARNING - MainThread - No data for interval Start: 2010-09-27 12:00:00, size=1:00:00
2010-09-27 16:03:31,967 WARNING - MainThread - No data for interval Start: 2010-09-27 13:00:00, size=1:00:00
2010-09-27 17:02:37,549 WARNING - MainThread - No data for interval Start: 2010-09-27 13:00:00, size=1:00:00
The last time there were messages like this was back in 2010-06-22, and it looks pretty rare before that (nothing since at least 2010-05-26, when this log begins). Unfortunately, it looks like DEBUG and INFO messages are either turned off or go somewhere else (maybe syslog?) since 2010-06-10, so I don't see what successful runs have looked like lately.

I'm going to go over the code some more and take a look at the staging database (which has a recent production dump) to see where this is falling down.
Rob - It looks like the start date for that version is off.  The start date is 2009-09-15; but considering the end date is 2010-10-15, the start date should probably be 2010-09-15.
(In reply to comment #3)
> Rob - It looks like the start date for that version is off.  The start date is
> 2009-09-15; but considering the end date is 2010-10-15, the start date should
> probably be 2010-09-15.

Ack thanks for catching that, I just zeroed in on the "09-15" part apparently.

Smokey, can you fix this and see if it helps (or let me know if it's ok for me to do it ?) Here's the entry in /admin/branch_data_sources as of right now:

Product 	Version 	Branch 	Release 	Start Date 	End Date
Camino 	2.1a1pre 	1.9.2 	development 	2009-09-15 	2010-10-15
(In reply to comment #3)
> Rob - It looks like the start date for that version is off.  The start date is
> 2009-09-15; but considering the end date is 2010-10-15, the start date should
> probably be 2010-09-15.

No, 2009 really is the right start date; we've been on 2.1a1pre since we branched before the final run to 2.0 last year :(  I keep moving the end date for 2.1a1pre back every couple of months when it gets close to expiration.

I've made the change to set the start date to 2010-09-15, though.

(In case that change doesn't help things, I'll also add that I've noticed our crash incidents for 2.0.4 [as reported by the default 14 day topcrasher report] have been down about 6% compared to 2.0.3. At first I wrote it off to the 1.8 data issues that roughly coincided with our 2.0.4 release, but the incident numbers have been pretty steady as time progressed, and based on the numbers of crashes in the signatures whose bugs we fixed between 2.0.3 and 2.0.4, I don't think we can assign all of that 6% reduction to fixes.)
It looks better now, but still seems lower than it should I think... I don't think this is just due to lag since the job that populates the topcrasher reports runs once per hour. For example:

http://crash-stats.mozilla.com/products/Camino

The numbers are about half that of:
http://crash-stats.mozilla.com/topcrasher/byversion/Camino/2.0.4

Which are just a little lower than:
http://crash-stats.mozilla.com/query/query?product=Camino&version=ALL%3AALL&range_value=2&range_unit=weeks&date=10%2F04%2F2010+15%3A11%3A58&query_search=signature&query_type=exact&query=&build_id=&process_type=any&hang_type=any&do_query=1

Am I using this correctly? I would think that the front-page reports would line up with the topcrasher/byversion/ page, and the advanced query would tend to be a little higher than that (since there's the hour lag for the cronjob to run).
(In reply to comment #6)
> It looks better now, but still seems lower than it should I think... I don't
> think this is just due to lag since the job that populates the topcrasher
> reports runs once per hour. For example:
> 
> http://crash-stats.mozilla.com/products/Camino
> 
> The numbers are about half that of:
> http://crash-stats.mozilla.com/topcrasher/byversion/Camino/2.0.4

Actually it all lines up exactly if we use the 7-day option:
http://crash-stats.mozilla.com/topcrasher/byversion/Camino/2.0.4/7

I could believe that this is 1 hour ahead:
http://crash-stats.mozilla.com/query/query?product=Camino&version=ALL%3AALL&range_value=1&range_unit=weeks&date=10%2F04%2F2010+15%3A11%3A58&query_search=signature&query_type=exact&query=&build_id=&process_type=any&hang_type=any&do_query=1

Maybe the "View All" link from the front page should link to the 7-day report, since that's the default on the front page (I notice now that I am paying attention)?
Attachment #480770 - Flags: review?(ryan)
(In reply to comment #9)
> http://crash-stats.mozilla.com/topcrasher/byversion/Camino/2.1a1pre (aka
> http://crash-stats.mozilla.com/topcrasher/byversion/Camino/2.1a1pre/14)
> and
> http://crash-stats.mozilla.com/query/query?product=Camino&version=Camino%3A2.1a1pre&platform=mac&range_value=2&range_unit=weeks&date=2010-10-04+21%3A00%3A00&query_search=signature&query_type=exact&query=&build_id=&process_type=any&hang_type=any&do_query=1
> 
> are still showing up wildly out-of-sync, even after the change in comment 4/5.
> 
> If I dial that back to 7 days topcrasher/1 week query-by-date, things are
> "closer" (12 crashes in the topcrash report, 38 in the query-by-date), but
> still too far off to be time-of-report-generation errors :(

Yes I agree, that doesn't look right. Not sure if 2.0.4 looks better because there is really a difference, or just more data obscuring the problem.

I'll dig into this further, at the very least I can concoct a test for development/staging to make sure it's working the way we intend, and look in production for any evidence of problems (we may want to turn the logging up, re: comment 2).
I think there are a couple issues here - and they should probably be tackled as separate issues.

First, the default value for all topcrasher searches is 14 days, yet the default value for ADU, topcrasher and topchanger searches on each product (and version) dashboard is 7 days.

Second, the topcrasher and report searches are a bit off, and it's not just with Camino.  Take Firefox 4.0b7pre for example:

http://crash-stats.mozilla.com/topcrasher/byversion/Firefox/4.0b7pre/14

http://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A4.0b7pre&platform=windows&platform=mac&platform=linux&platform=solaris&range_value=2&range_unit=weeks&date=2010-10-05+13%3A20%3A00&query_search=signature&query_type=exact&query=&build_id=&process_type=any&hang_type=any&do_query=1

These 2 pages are querying different data sets.  The topcrasher query is an api call to the middleware layer, which queries against the top_crashes_by_signature table.  The reports query is a db call made by the UI layer to the reports table in Common_Model::common.queryReports().  From a quick look, it seems like there is a discrepancy between these 2 data sets.
Comment on attachment 480770 [details] [diff] [review]
link to same duration that dashboard is set to

Split out to bug 602054
Attachment #480770 - Flags: review?(ryan)
(In reply to comment #11)
> I think there are a couple issues here - and they should probably be tackled as
> separate issues.
> 
> First, the default value for all topcrasher searches is 14 days, yet the
> default value for ADU, topcrasher and topchanger searches on each product (and
> version) dashboard is 7 days.
> 
> Second, the topcrasher and report searches are a bit off, and it's not just
> with Camino.  Take Firefox 4.0b7pre for example:
> 
> http://crash-stats.mozilla.com/topcrasher/byversion/Firefox/4.0b7pre/14
> 
> http://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A4.0b7pre&platform=windows&platform=mac&platform=linux&platform=solaris&range_value=2&range_unit=weeks&date=2010-10-05+13%3A20%3A00&query_search=signature&query_type=exact&query=&build_id=&process_type=any&hang_type=any&do_query=1
> 
> These 2 pages are querying different data sets.  The topcrasher query is an api
> call to the middleware layer, which queries against the
> top_crashes_by_signature table.  The reports query is a db call made by the UI
> layer to the reports table in Common_Model::common.queryReports().  From a
> quick look, it seems like there is a discrepancy between these 2 data sets.

Yes I agree, I am going to change the summary of this one slightly. We're going to change the default for the dashboard to 14 days in bug 602054 (and not pass the duration around as my now-obsolete patch did).
Summary: Camino 2.1a1pre topcrash report missing many crashes → top_crashes_by_signature table does not match reports table
Attachment #480770 - Attachment is obsolete: true
Priority: -- → P2
Priority: P2 → P1
Now that the unittest for this is working consistently locally and via Hudson, I am going to try to expose this bug in the test, and then fix it.
Depends on: 606332
(In reply to comment #14)
> Now that the unittest for this is working consistently locally and via Hudson,
> I am going to try to expose this bug in the test, and then fix it.

I haven't had any luck flushing out the bug yet. I am going to explore the data a bit, and see if that yields anything.

Since I see this on staging, I decided to grab a snapshot of what I think should be two equivalent data sets. 
I would expect the counts below to be equivalent but they are not:

=> select sum(count) from top_crashes_by_signature where window_end >= '2010-10-20 14:00:00' and window_end <= '2010-11-03 14:00:00' and signature = 'hang | KiFastSystemCallRet' and productdims_id = '115';
------
 2468
(1 row)

=> select count(*) from reports where signature = 'hang | KiFastSystemCallRet' and completed_datetime >= '2010-10-20 14:00:00' and completed_datetime <= '2010-11-03 14:00:00' and version = '3.6.11';
-------
  2475
(1 row)


If anyone sees a problem with the above queries, please let me know! There are a lot of moving parts here so I am trying to distill this down to the behavior that we actually want, as opposed to the queries that different parts of the system are using right now.

I went ahead and saved these into CSV files for further analysis:


=> \copy (select * from top_crashes_by_signature where window_end >= '2010-10-20 14:00:00' and window_end <= '2010-11-03 14:00:00' and signature = 'hang | KiFastSystemCallRet' and productdims_id = '115') to '/home/rhelmer/bug600473-tcbs.csv'

=> \copy (select * from reports where signature = 'hang | KiFastSystemCallRet' and completed_datetime >= '2010-10-20 14:00:00' and completed_datetime <= '2010-11-03 14:00:00' and version = '3.6.11') to '/home/rhelmer/bug600473-reports.csv'


Next steps are to determine which window(s) in tcbs.csv have the wrong count, and see what is different about the corresponding rows in reports.csv
(In reply to comment #15)
> => select count(*) from reports where signature = 'hang | KiFastSystemCallRet'
> and completed_datetime >= '2010-10-20 14:00:00' and completed_datetime <=
> '2010-11-03 14:00:00' and version = '3.6.11';
> -------
>   2475
> (1 row)
> 
> 
> If anyone sees a problem with the above queries, please let me know! There are
> a lot of moving parts here so I am trying to distill this down to the behavior
> that we actually want, as opposed to the queries that different parts of the
> system are using right now.

Just to answer my own question, I was getting unexpected results with the above. Looks like the UI is really doing something more like:

=> select count(*) from reports where signature = 'hang | KiFastSystemCallRet' and date_processed >= '2010-10-20 14:00:00' and date_processed <= '2010-11-03 14:00:00' and version = '3.6.11';
 count 
-------
  2470
(1 row)

I'll redo the CSV dumps with this in mind.
One thing that seems to correlate well for TCBS windows with missing reports is started_datetime and completed_datetime being in a different hour than date_processed.

From reading socorro/processor/processor.py, I think that:

1) date_processed comes from metadata (jsonDocument["submitted_timestamp"])
2) {started,completed}_datetime are set using datetime.datetime.now() by the processor, as processing starts and completes

It looks like the modpython-collector inserts #1 at collection time:
http://code.google.com/p/socorro/source/browse/tags/releases/1.7.4_r2552_20101005/socorro/collector/modpython-collector.py#49

Per bug 608190 comment 2, scripts/startTopCrashesBySignature runs once per hour, at two minutes past the hour ("cron_aggregates.sh"). From reading socorro/cron/topCrashesBySignature, I think that it only considers "date_processed":

http://code.google.com/p/socorro/source/browse/tags/releases/1.7.4_r2552_20101005/socorro/cron/topCrashesBySignature.py#111

I had been leaning towards a theory that Lars proposed before, that late-processed crashes were not getting picked up by the TCBS cron, but the SQL query in the previous link makes me unsure since it looks like it's taking into account product_visibility (in the sample I took, it'd be looking for crashes between '2010-10-05 00:00:00' and '2011-01-03 00:00:00', so it should effectively backfill any late-processed crashes).

However there's a lot going on in topCrashesBySignature.py, I'd like to see what the SQL queries look like when all the variables are bound. Looks like hopefully all of what I need is already logged, but only when debug is enabled (we have logging level set to debug on staging, but this is the actual "debug" setting). 

Next steps I am taking are:

* turn on debug for staging, so tCBS.py will log queries
* check my sample set to see if there are any tcbs windows with no missing reports, yet {started,completed}_datetime is in a different hour than date_processed

I'm using a script to do analysis on the samples I pulled in comment 15/comment 16, I'll post that once the above check is added.
(In reply to comment #17)
> http://code.google.com/p/socorro/source/browse/tags/releases/1.7.4_r2552_20101005/socorro/cron/topCrashesBySignature.py#111
> 
> I had been leaning towards a theory that Lars proposed before, that
> late-processed crashes were not getting picked up by the TCBS cron, but the SQL
> query in the previous link makes me unsure since it looks like it's taking into
> account product_visibility (in the sample I took, it'd be looking for crashes
> between '2010-10-05 00:00:00' and '2011-01-03 00:00:00', so it should
> effectively backfill any late-processed crashes).

After looking at this more closely, this last paragraph does not make sense. The window being considered is startWindow to startWindow+deltaWindow:
http://code.google.com/p/socorro/source/browse/tags/releases/1.7.4_r2552_20101005/socorro/cron/topCrashesBySignature.py#203

The product_visibility join is to only include products that are not ignored, and that are currently active.

I'll confirm this when debugging is enabled on staging, but this better fits my previous understanding, and is making the theory that late-processed crashes are at least partly responsible more plausible.
Quick hack to show the difference between the tcbs and reports tables CSVs. I used a single signature, and single release (3.6.11) of Firefox (to make sure we're not seeing effects of nightly-specific logic, which may exclude based on buildID).

I've cleaned up the output here to make it more human readable, and annotated with comments:

# I suspect that tcbs and reports table disagree about which window should be
# used for one report in this range. Ignored this one, for now, since it
# will balance out eventually.
tcbs date: 2010-10-24 18:00:00 tcbs: 20 actual: 21
tcbs date: 2010-10-24 19:00:00 tcbs: 22 actual: 21

# this looks like a real problem. I've looked at the "actual" report rows for
# this window, and don't see any reason they should be ignored.
tcbs date: 2010-10-22 14:00:00 tcbs: 7 actual: 16
# if we long-running jobs with 
# "INFO: This record is a replacement" processor notes, then we find:
8 matching long-running reports # (took over 3 hours to process)

# This leaves 1 unaccounted for; it could have been a replacement for 
# an originally long-running process. I don't think there's any way to tell
# the original running time of a process when this happens.

# below matches the UI output at the time the data was saved
total tcbs: 2328
total reports: 2337

For completeness, here's how I dumped the data. Note that the hour is offset -1 for the reports table, because the tcbs date marks the end of the window:

=> \copy (select * from top_crashes_by_signature where window_end >= '2010-10-20 19:00:00' and window_end < '2010-11-03 19:00:00' and signature = 'hang | KiFastSystemCallRet' and productdims_id = '115') to '/home/rhelmer/bug600473-tcbs.csv'
=> \copy (select * from reports where signature = 'hang | KiFastSystemCallRet' and date_processed >= '2010-10-20 18:00:00' and date_processed < '2010-11-03 18:00:00' and version = '3.6.11') to '/home/rhelmer/bug600473-reports.csv'
(In reply to comment #19)
> # if we long-running jobs with 

Should say: "if we *exclude* long-running jobs with"

> # "INFO: This record is a replacement" processor notes, then we find:
> 8 matching long-running reports # (took over 3 hours to process)

As mentioned elsewhere in comment 19, I don't think there's any way to tell the original running time of a process when this replacement has happened.
Blocks: 632810
Severity: major → normal
Priority: P1 → --
Priority: -- → P3
Not working on this right now, I think consensus is to move away from TCBS altogether.
Assignee: rhelmer → nobody
Status: ASSIGNED → NEW
Component: Socorro → General
Product: Webtools → Socorro
Component: General → Webapp
rhelmer, is there still an issue there for newtcbs? If not, we probably should just close out this bug.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #23)
> rhelmer, is there still an issue there for newtcbs? If not, we probably
> should just close out this bug.

doubt it, will double check.. thanks for digging this up.
Assignee: nobody → rhelmer
We've replaced the old TCBS impl.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: