Closed Bug 519807 Opened 16 years ago Closed 15 years ago

Socorro - TCBS doesn't match totals in comparison with the advanced search

Categories

(Socorro :: General, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lars, Unassigned)

Details

Attachments

(1 file)

As of 2009/09/30 the totals for the signature "cooliris19.dll@0x351f2" between the live search and the Top Crash by Signature trend reports do not match. Why is that?
Using this as a place to keep notes. The queries are entirely different: TREND: /* soc.web topcrash.topcrasherss */ SELECT p.product AS product, p.version AS version, tcs.signature, sum(tcs.count) as total, sum(case when o.os_name = 'Windows NT' then tcs.count else 0 end) as win, sum(case when o.os_name = 'Mac OS X' then tcs.count else 0 end) as mac, sum(case when o.os_name = 'Linux' then tcs.count else 0 end) as linux FROM top_crashes_by_signature tcs JOIN productdims p ON tcs.productdims_id = p.id JOIN osdims o ON tcs.osdims_id = o.id WHERE p.product='Firefox' AND p.version='3.5.3' AND window_end >= '2009-09-15 16:36:01' GROUP BY p.product, p.version, tcs.signature HAVING sum(tcs.count) > 0 ORDER BY total desc LIMIT 100 QUERY: /* soc.web common.queryTopSig. */ SELECT reports.signature, count(reports.id), count(CASE WHEN (reports.os_name = 'Windows NT') THEN 1 END) AS is_windows, count(CASE WHEN (reports.os_name = 'Mac OS X') THEN 1 END) AS is_mac, count(CASE WHEN (reports.os_name = 'Linux') THEN 1 END) AS is_linux, count(CASE WHEN (reports.os_name = 'Solaris') THEN 1 END) AS is_solaris FROM reports WHERE reports.signature IS NOT NULL AND (reports.product = 'Firefox') AND ((reports.product = 'Firefox' AND reports.version = '3.5.3')) AND reports.date_processed BETWEEN TIMESTAMP '2009-10-01 00:07:19'-CAST('1 weeks' AS INTERVAL) AND TIMESTAMP '2009-10-01 00:07:19' GROUP BY reports.signature ORDER BY count(reports.id) DESC LIMIT 100
Wouldn't the CAST() in the queryTopSig section hurt constraint exclusion for partitioning? Or is that only a but with 8.2?
That CAST is just an artifact of the language: Need to subtract a week, and that is one way to say it. The result is still a TIMESTAMP. I'm thinking about why this query is running against reports, rather than top_crashes_by_signature though...
Even though the queries are different, they should still come up with the same results when asked for the same data. This query shows over 14K crashes for the most recent two weeks for the specific signature 'cooliris19.dll@0x351f2' for FF 3.5.3: http://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A3.5.3&date=&range_value=2&range_unit=weeks&query_search=signature&query_type=exact&query=cooliris19.dll%400x351f2&do_query=1 Over on TCBS for the same time period and same product/version, the signature 'cooliris19.dll@0x351f2' shows as #3 with a total of only 3K. The TCBS script originally gets its information from the reports table and aggregates it. The UI then does further aggregation by combining the script's aggregated buckets for the requested time period. Either the script or the UI is under counting.
here's some more evidence that the TCBS script is under counting. First, just looking at one 12 minute time slice from the period ending at 3pm on 2009-09-27 for the signature 'cooliris19.dll@0x351f2' and FF3.5.3 breakpad=> select breakpad-> p.count, breakpad-> p.product, breakpad-> p.version, breakpad-> o.os_name, breakpad-> o.os_version breakpad-> from breakpad-> top_crashes_by_signature t join osdims o on t.osdims_id = o.id breakpad-> join productdims p on t.productdims_id = p.id breakpad-> where breakpad-> signature = 'cooliris19.dll@0x351f2' breakpad-> and productdims_id = 46 breakpad-> and window_end = '2009-09-27 15:00:00'; count | product | version | os_name | os_version -------+---------+---------+------------+------------------------- 2 | Firefox | 3.5.3 | Windows NT | 6.0.6002 Service Pack 2 1 | Firefox | 3.5.3 | Windows NT | 5.1.2600 Service Pack 2 (2 rows) the script found a total of three crashes for 2 os versions. Going back to the original data for that time period and looking for exactly the same thing: breakpad=> select breakpad-> product, breakpad-> version, breakpad-> os_name, breakpad-> os_version, breakpad-> date_processed breakpad-> from breakpad-> reports_20090921 breakpad-> where breakpad-> signature = 'cooliris19.dll@0x351f2' breakpad-> and product = 'Firefox' breakpad-> and version = '3.5.3' breakpad-> and '2009-09-27 14:48:00' < date_processed and date_processed < '2009-09-27 15:00:00'; product | version | os_name | os_version | date_processed ---------+---------+------------+-------------------------+---------------------------- Firefox | 3.5.3 | Windows NT | 5.1.2600 Service Pack 2 | 2009-09-27 14:49:50.560098 Firefox | 3.5.3 | Windows NT | 6.0.6002 Service Pack 2 | 2009-09-27 14:53:59.705702 Firefox | 3.5.3 | Windows NT | 6.0.6002 Service Pack 2 | 2009-09-27 14:52:15.497293 Firefox | 3.5.3 | Windows NT | 5.1.2600 Service Pack 3 | 2009-09-27 14:56:23.766007 Firefox | 3.5.3 | Windows NT | 6.0.6002 Service Pack 2 | 2009-09-27 14:58:46.563914 (5 rows) it finds 5 matching entries for that same time period.
I have found two things that together make the totals come out the same: 1: There are a significant number of crashes for which os_name and os_version are NULL. This causes a difference in behavior 2: The period of time covered by the two queries is different. This causes a significant difference (more or less a factor of 2) between the two views of the data. I propose to fix item 1 by: - ALTER TABLE osdims ALTER COLUMN os_name DROP NOT NULL; - patch (attached) for cachedIdAccess.py to allow nulls I propose to fix item 2 in one of two possible ways: EITHER - change the range of dates to match OR - put up a note on one or both views mentioning the difference in time span Note that it is not entirely trivial to make the two views match, since one deals with date_processed and the other deals with window_end, so a simplistic fix would make things very _nearly_ the same, but not quite exactly so.
This is one part of a two-part fix. The other part of the fix is to actually alter osdims to allow nulls in the os_name column.
Attachment #405380 - Flags: review?(lars)
Griswolf has rerun the TCBS script after his changes. I have rerun my queries from comment #6 and find that the data now matches. I'm going to suggest that we discard the existing 'top_crashes_by_signature' table in production and regenerate it by rerunning the script. Should we run it directly on production or try the trick of copying from dm-breakpad-stagedb again? This time I want us to directly monitor the ongoing performance. After the script has run for a day, I want us to cross check using variants of my Comment #6 queries. Specifically, I want us to aim at data that came in after the original run of the script.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Attachment #405380 - Flags: review?(lars)
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: