Socorro - TCBS doesn't match totals in comparison with the advanced search

RESOLVED FIXED

Status

RESOLVED FIXED
9 years ago
7 years ago

People

(Reporter: lars, Unassigned)

Tracking

Trunk
x86
Linux

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

9 years ago
As of 2009/09/30 the totals for the signature "cooliris19.dll@0x351f2" between the live search and the Top Crash by Signature trend reports do not match.  Why is that?
Using this as a place to keep notes. The queries are entirely different:
TREND:
/* soc.web topcrash.topcrasherss */ 
SELECT p.product AS product, p.version AS version, tcs.signature, 
       sum(tcs.count) as total, 
       sum(case when o.os_name = 'Windows NT' then tcs.count else 0 end) as win, 
       sum(case when o.os_name = 'Mac OS X' then tcs.count else 0 end) as mac,
       sum(case when o.os_name = 'Linux' then tcs.count else 0 end) as linux 
 FROM top_crashes_by_signature tcs 
 JOIN productdims p ON tcs.productdims_id = p.id 
 JOIN osdims o ON tcs.osdims_id = o.id 
WHERE p.product='Firefox' AND p.version='3.5.3' AND window_end >= '2009-09-15 16:36:01' 
GROUP BY p.product, p.version, tcs.signature 
HAVING sum(tcs.count) > 0 
ORDER BY total desc LIMIT 100

QUERY:
/* soc.web common.queryTopSig. */ 
SELECT reports.signature, count(reports.id), 
       count(CASE WHEN (reports.os_name = 'Windows NT') THEN 1 END) AS is_windows, 
       count(CASE WHEN (reports.os_name = 'Mac OS X') THEN 1 END) AS is_mac, 
       count(CASE WHEN (reports.os_name = 'Linux') THEN 1 END) AS is_linux, 
       count(CASE WHEN (reports.os_name = 'Solaris') THEN 1 END) AS is_solaris 
  FROM reports 
 WHERE reports.signature IS NOT NULL 
    AND (reports.product = 'Firefox')
    AND ((reports.product = 'Firefox' AND reports.version = '3.5.3'))
    AND reports.date_processed BETWEEN TIMESTAMP '2009-10-01 00:07:19'-CAST('1 weeks' AS INTERVAL) AND TIMESTAMP '2009-10-01 00:07:19'
GROUP BY reports.signature 
ORDER BY count(reports.id) DESC LIMIT 100
Wouldn't the CAST() in the queryTopSig section hurt constraint exclusion for partitioning?  Or is that only a but with 8.2?
*bug
That CAST is just an artifact of the language: Need to subtract a week, and that is one way to say it. The result is still a TIMESTAMP.

I'm thinking about why this query is running against reports, rather than top_crashes_by_signature though...
(Reporter)

Comment 5

9 years ago
Even though the queries are different, they should still come up with the same results when asked for the same data.  

This query shows over 14K crashes for the most recent two weeks for the specific signature 'cooliris19.dll@0x351f2' for FF 3.5.3:

http://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A3.5.3&date=&range_value=2&range_unit=weeks&query_search=signature&query_type=exact&query=cooliris19.dll%400x351f2&do_query=1

Over on TCBS for the same time period and same product/version, the signature 'cooliris19.dll@0x351f2' shows as #3 with a total of only 3K.  

The TCBS script originally gets its information from the reports table and aggregates it.  The UI then does further aggregation by combining the script's aggregated buckets for the requested time period.  Either the script or the UI is under counting.
(Reporter)

Comment 6

9 years ago
here's some more evidence that the TCBS script is under counting.  

First, just looking at one 12 minute time slice from the period ending at 3pm on 2009-09-27 for the signature 'cooliris19.dll@0x351f2' and FF3.5.3

breakpad=> select
breakpad->   p.count,
breakpad->   p.product,
breakpad->   p.version,
breakpad->   o.os_name,
breakpad->   o.os_version
breakpad-> from 
breakpad->   top_crashes_by_signature t join osdims o on t.osdims_id = o.id
breakpad->     join productdims p on t.productdims_id = p.id
breakpad-> where 
breakpad->   signature = 'cooliris19.dll@0x351f2'
breakpad->   and productdims_id = 46
breakpad->   and window_end = '2009-09-27 15:00:00';
 count | product | version |  os_name   |       os_version        
-------+---------+---------+------------+-------------------------
     2 | Firefox | 3.5.3   | Windows NT | 6.0.6002 Service Pack 2
     1 | Firefox | 3.5.3   | Windows NT | 5.1.2600 Service Pack 2
(2 rows)

the script found a total of three crashes for 2 os versions.  Going back to the original data for that time period and looking for exactly the same thing:

breakpad=> select 
breakpad->   product,
breakpad->   version,
breakpad->   os_name,
breakpad->   os_version,
breakpad->   date_processed
breakpad-> from
breakpad->   reports_20090921
breakpad-> where
breakpad->   signature = 'cooliris19.dll@0x351f2'
breakpad->   and product = 'Firefox'
breakpad->   and version = '3.5.3'
breakpad->   and '2009-09-27 14:48:00' < date_processed and date_processed < '2009-09-27 15:00:00';
 product | version |  os_name   |       os_version        |       date_processed       
---------+---------+------------+-------------------------+----------------------------
 Firefox | 3.5.3   | Windows NT | 5.1.2600 Service Pack 2 | 2009-09-27 14:49:50.560098
 Firefox | 3.5.3   | Windows NT | 6.0.6002 Service Pack 2 | 2009-09-27 14:53:59.705702
 Firefox | 3.5.3   | Windows NT | 6.0.6002 Service Pack 2 | 2009-09-27 14:52:15.497293
 Firefox | 3.5.3   | Windows NT | 5.1.2600 Service Pack 3 | 2009-09-27 14:56:23.766007
 Firefox | 3.5.3   | Windows NT | 6.0.6002 Service Pack 2 | 2009-09-27 14:58:46.563914
(5 rows)

it finds 5 matching entries for that same time period.
I have found two things that together make the totals come out the same:

1: There are a significant number of crashes for which os_name and os_version are NULL. This causes a difference in behavior

2: The period of time covered by the two queries is different. This causes a significant difference (more or less a factor of 2) between the two views of the data.

I propose to fix item 1 by:
  - ALTER TABLE osdims ALTER COLUMN os_name DROP NOT NULL;
  - patch (attached) for cachedIdAccess.py to allow nulls

I propose to fix item 2 in one of two possible ways: EITHER
  - change the range of dates to match
OR
  - put up a note on one or both views mentioning the difference in time span

Note that it is not entirely trivial to make the two views match, since one deals with date_processed and the other deals with window_end, so a simplistic fix would make things very _nearly_ the same, but not quite exactly so.
Created attachment 405380 [details] [diff] [review]
Handles null os_name and os_version for osdims table which is altered to allow them

This is one part of a two-part fix. The other part of the fix is to actually alter osdims to allow nulls in the os_name column.
Attachment #405380 - Flags: review?(lars)
(Reporter)

Comment 9

9 years ago
Griswolf has rerun the TCBS script after his changes.  I have rerun my queries from comment #6 and find that the data now matches.  I'm going to suggest that we discard the existing 'top_crashes_by_signature' table in production and regenerate it by rerunning the script.  Should we run it directly on production or try the trick of copying from dm-breakpad-stagedb again?

This time I want us to directly monitor the ongoing performance.  After the script has run for a day, I want us to cross check using variants of my Comment #6 queries.  Specifically, I want us to aim at data that came in after the original run of the script.
(Reporter)

Updated

9 years ago
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
(Reporter)

Updated

8 years ago
Attachment #405380 - Flags: review?(lars)
(Assignee)

Updated

7 years ago
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.