Closed Bug 721169 Opened 13 years ago Closed 13 years ago

Inconsistent numbers across TCBS, search and report/list

Categories

(Socorro Graveyard :: Middleware, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: adrian, Assigned: adrian)

References

Details

We still have non matching numbers between TCBS and search / report/list. Those last two are running the same code base which mean they are consistent, but TCBS shows different numbers (often less crashes) for identical periods. This bug is to investigate and fix those inconsistencies, wherever they are.
Assignee: nobody → adrian
Copy / pasting an email from Josh: > SELECT count(*) FROM reports >> WHERE date_processed BETWEEN '2012-01-15T00:00:00+00:00' AND >> '2012-01-16T00:00:00+00:00'; >> >> # Returns 332371 results >> >> >> #.............................................................................. >> SELECT count(*) FROM reports_clean >> WHERE date_processed BETWEEN '2012-01-15T00:00:00+00:00' AND >> '2012-01-16T00:00:00+00:00'; >> >> # Returns 244635 results This difference is expected. Reports contains crashes which reports_clean does not, mainly: -- FF/TB/Fennec versions before 5.0 -- garbaged crashes where no product/version can be discerned Try comparing counts for a specific version of FF. >> #.............................................................................. >> SELECT sum(report_count) FROM tcbs >> WHERE report_date BETWEEN '2012-01-15T00:00:00+00:00' AND >> '2012-01-16T00:00:00+00:00'; >> >> # Returns 218710 results This is more interesting; I'm not sure why TCBS would have a lower count than reports_clean. When we did 2.4 I tested specific product/versions and results were correct, but I didn't compare grand totals. If you have time, it would be worth finding specific product/version/signature combos where the counts don't match between reports_clean and TCBS as test cases.
So, I tested with a specific product and version, and a smaller date range. > SELECT count(*) FROM reports > WHERE date_processed BETWEEN '2012-01-15T00:00:00+00:00' AND '2012-01-15T01:00:00+00:00' > AND product=E'Firefox' AND version=E'12.0a1'; > > # Returns 38 results > > #.............................................................................. > SELECT count(*) FROM reports_clean > JOIN product_versions pv USING (product_version_id) > WHERE date_processed BETWEEN '2012-01-15T00:00:00+00:00' AND '2012-01-15T01:00:00+00:00' > AND product_name=E'Firefox' AND version_string=E'12.0a1'; > > # Returns 34 results And then I tried to display the difference between those two queries. They have the exact same distinct signatures but different counts in some cases: > SELECT signature, count(*) FROM reports > WHERE date_processed BETWEEN '2012-01-15T00:00:00+00:00' AND '2012-01-15T01:00:00+00:00' > AND product=E'Firefox' AND version=E'12.0a1' > GROUP BY signature > EXCEPT > SELECT signature, count(*) FROM reports_clean > JOIN signatures USING (signature_id) > JOIN product_versions pv USING (product_version_id) > WHERE date_processed BETWEEN '2012-01-15T00:00:00+00:00' AND '2012-01-15T01:00:00+00:00' > AND product_name=E'Firefox' AND version_string=E'12.0a1' > GROUP BY signature; > > # Returns 4 results signature | count -----------------------------------------------------------------------------------------------------------+------- kernelbase.dll@0xcacd | 2 JSFunction::trace(JSTracer*) | 2 mozalloc_abort(char const* const) | NS_DebugBreak_P | nsIFrame::GetOffsetToCrossDoc(nsIFrame const*, int) | 2 JSObject::getGeneric(JSContext*, __int64, JS::Value*) | 2 (4 rows) Here are the four concerned UUIDs: bf504428-6e50-42ff-8298-3e3942120115 56f092a9-6c87-4e39-ad20-d20f82120115 520b5c9a-6589-453a-9189-dc4232120115 ef545c0a-0d46-4e6e-9384-af5ad2120115 Trying to hit them on prod gives me a "We couldn't find the OOID you're after." error. Does this mean anything? Josh, do you have any idea of what's going on?
@Josh: You asked me to find out why there was a difference in the total number of crashes in reports_clean and in tcbs. Here is what I found: SELECT product_name, version_string FROM reports_clean JOIN product_versions pv USING (product_version_id) WHERE date_processed BETWEEN '2012-01-15T00:00:00+00:00' AND '2012-01-16T00:00:00+00:00' GROUP BY product_name, version_string EXCEPT SELECT product_name, version_string FROM tcbs JOIN product_versions pv USING (product_version_id) WHERE report_date BETWEEN '2012-01-15T00:00:00+00:00' AND '2012-01-16T00:00:00+00:00' GROUP BY product_name, version_string ORDER BY product_name, version_string; # Returns 79 results product_name | version_string --------------+---------------- Fennec | 10.0a1 Fennec | 5.0 Fennec | 5.0b2 Fennec | 5.0b7 Fennec | 6.0 Fennec | 6.0.2 Fennec | 6.0b1 Fennec | 6.0b2 Fennec | 6.0b3 Fennec | 6.0b4 Fennec | 6.0b5 Fennec | 6.0b6 Fennec | 7.0 Fennec | 7.0b1 Fennec | 7.0b4 Fennec | 7.0b5 Fennec | 8.0b1 Fennec | 8.0b2 Fennec | 8.0b3 Fennec | 8.0b6 Fennec | 9.0b1 Firefox | 10.0a1 Firefox | 10.0a2 Firefox | 11.0a1 Firefox | 5.0 Firefox | 5.0.1 Firefox | 5.0b1 Firefox | 5.0b2 Firefox | 5.0b3 Firefox | 5.0b4 Firefox | 5.0b5 Firefox | 5.0b6 Firefox | 5.0b7 Firefox | 5.0(beta) Firefox | 6.0 Firefox | 6.0.1 Firefox | 6.0.2 Firefox | 6.0b1 Firefox | 6.0b2 Firefox | 6.0b3 Firefox | 6.0b4 Firefox | 6.0b5 Firefox | 6.0(beta) Firefox | 7.0 Firefox | 7.0.1(beta) Firefox | 7.0b1 Firefox | 7.0b2 Firefox | 7.0b3 Firefox | 7.0b4 Firefox | 7.0b5 Firefox | 7.0b6 Firefox | 7.0(beta) Firefox | 8.0a2 Firefox | 8.0b1 Firefox | 8.0b2 Firefox | 8.0b3 Firefox | 8.0b4 Firefox | 8.0b5 Firefox | 8.0b6 Firefox | 8.0(beta) Firefox | 9.0a1 Firefox | 9.0a2 Firefox | 9.0b1 SeaMonkey | 2.3 SeaMonkey | 2.3.1 SeaMonkey | 2.3.3 SeaMonkey | 2.3b1 SeaMonkey | 2.3b2 SeaMonkey | 2.3b3 SeaMonkey | 2.4b2 SeaMonkey | 2.5b3 SeaMonkey | 2.5b4 Thunderbird | 10.0a1 Thunderbird | 6.0 Thunderbird | 6.0.1 Thunderbird | 6.0.2 Thunderbird | 7.0b3 Thunderbird | 8.0b1 Thunderbird | 8.0b5 (79 rows) I suppose those product/version pairs are not concerned by TCBS?
Adrian, Regarding the 4 UUIDs you can't hit in prod: see if you can get lars or rhelmer to help you as to why this might be. I suspect that whatever prevents you from viewing them is also prevening me from adding them to reports_clean. As for the query regarding the differences between reports_clean and TCBS, what you've demonstrated is that differences in counts are not at all related to product or version. So something else is making up the difference. Thanks for doing this!
Hey Josh, I may have missed something, but it seems to me that I demonstrated that TCBS is missing some products / versions from reports_clean. Maybe those queries will make it clearer: SELECT count(*) FROM (SELECT product_name, version_string FROM reports_clean JOIN product_versions pv USING (product_version_id) WHERE date_processed BETWEEN '2012-01-15T00:00:00+00:00' AND '2012-01-16T00:00:00+00:00' GROUP BY product_name, version_string) products_versions; # Returns 139 SELECT count(*) FROM (SELECT product_name, version_string FROM tcbs JOIN product_versions pv USING (product_version_id) WHERE report_date BETWEEN '2012-01-15T00:00:00+00:00' AND '2012-01-16T00:00:00+00:00' GROUP BY product_name, version_string) products_versions; # Returns 60 There is a difference of 79 pairs of product:version in TCBS compared to reports_clean. Or am I totally wrong and missing something?
rhelmer: could you please help me finding what is wrong with those four reports? Here are the UUIDs: bf504428-6e50-42ff-8298-3e3942120115 56f092a9-6c87-4e39-ad20-d20f82120115 520b5c9a-6589-453a-9189-dc4232120115 ef545c0a-0d46-4e6e-9384-af5ad2120115 They cannot be accessed through the UI (Crash Not Found), and Josh suppose what causes this also causes the bug we're investigating. I have tried to find something different from those and regular reports but couldn't find anything. I also don't know why those reports are not included into reports_clean.
(In reply to Adrian Gaudebert [:adrian] from comment #6) > rhelmer: could you please help me finding what is wrong with those four > reports? Sure, tracking them down now.
(In reply to Adrian Gaudebert [:adrian] from comment #6) > rhelmer: could you please help me finding what is wrong with those four > reports? Here are the UUIDs: > > bf504428-6e50-42ff-8298-3e3942120115 > 56f092a9-6c87-4e39-ad20-d20f82120115 > 520b5c9a-6589-453a-9189-dc4232120115 > ef545c0a-0d46-4e6e-9384-af5ad2120115 > > They cannot be accessed through the UI (Crash Not Found), and Josh suppose > what causes this also causes the bug we're investigating. Which environment are you looking at? I don't see a record of these UUIDs in production/staging hbase (raw json) or postgres (reports table).
> There is a difference of 79 pairs of product:version in TCBS compared to > reports_clean. Or am I totally wrong and missing something? Oh, right! Sorry. reports_clean contains crashes from *all* versions. TCBS contains crashes only from "visible" versions, i.e. those between release_date and sunset_date. So, for example, a crash related to FF 7.0 would show up in reports_clean but not in TCBS. So the real test here is to focus on specific product/versions which are currently "visible", not other total counts.
I think I found one part of the problem: those 4 crash reports where processed on 2012-01-17 when they where received on 2012-01-15. So, when update_reports_clean ran, those four crashes where _not_ processed yet, and thus where not added to reports_clean. However, if it true in this particular case, it is not in others. I have another list of UUIDs of crash reports missing in reports_clean: 59efac4e-3ee6-41f3-8dd9-744a92120110 b183de8f-b497-4ac7-a697-3a28f2120110 05913b2e-04d4-4ceb-a0f1-7acb02120110 781d92d0-95c9-4cb4-ae1e-9d5e02120110 428206ad-8b27-4a68-a44f-dae2c2120110 5f0577cc-e97c-4573-b88a-115b92120110 0c6e7a7d-3f9f-4c20-8e2e-f7bb92120110 e30ba1b1-a083-45f4-9345-60af02120110 42401c57-c34d-411b-95d2-121762120110 fbd8418b-3ef9-40a1-911a-83e542120110 a0e6acce-344e-4340-a18a-ab6b42120110 fdb0ee10-639c-4de0-a308-199c22120110 b80ed9f0-bc70-4297-a215-f43952120110 5539a4b5-8b42-4138-948e-7064b2120110 c75961c4-841a-46f3-98a5-184152120110 55051391-645f-466a-be28-190002120110 80e63828-3835-4d8d-9ae2-3e7702120110 135a47ff-321f-44b6-a1bd-adb002120110 f3790ef6-6bcb-485a-9ff9-720cf2120110 I really don't get it, those crashes seem to be alright.
(In reply to Robert Helmer [:rhelmer] from comment #8) > Which environment are you looking at? I don't see a record of these UUIDs in > production/staging hbase (raw json) or postgres (reports table). I'm running my tests in my VM, with the dump from Jan16. (In reply to [:jberkus] Josh Berkus from comment #9) > So the real test here is to focus on specific product/versions which are > currently "visible", not other total counts. Ok, I'll create a script to test all versions and see if we have any inconsistency, but I doubt I'll find any difference.
(In reply to Adrian Gaudebert [:adrian] from comment #10) > I think I found one part of the problem: those 4 crash reports where > processed on 2012-01-17 when they where received on 2012-01-15. So, when > update_reports_clean ran, those four crashes where _not_ processed yet, and > thus where not added to reports_clean. > > However, if it true in this particular case, it is not in others. I have > another list of UUIDs of crash reports missing in reports_clean: > > 59efac4e-3ee6-41f3-8dd9-744a92120110 > b183de8f-b497-4ac7-a697-3a28f2120110 > 05913b2e-04d4-4ceb-a0f1-7acb02120110 > 781d92d0-95c9-4cb4-ae1e-9d5e02120110 > 428206ad-8b27-4a68-a44f-dae2c2120110 > 5f0577cc-e97c-4573-b88a-115b92120110 > 0c6e7a7d-3f9f-4c20-8e2e-f7bb92120110 > e30ba1b1-a083-45f4-9345-60af02120110 > 42401c57-c34d-411b-95d2-121762120110 > fbd8418b-3ef9-40a1-911a-83e542120110 > a0e6acce-344e-4340-a18a-ab6b42120110 > fdb0ee10-639c-4de0-a308-199c22120110 > b80ed9f0-bc70-4297-a215-f43952120110 > 5539a4b5-8b42-4138-948e-7064b2120110 > c75961c4-841a-46f3-98a5-184152120110 > 55051391-645f-466a-be28-190002120110 > 80e63828-3835-4d8d-9ae2-3e7702120110 > 135a47ff-321f-44b6-a1bd-adb002120110 > f3790ef6-6bcb-485a-9ff9-720cf2120110 > > I really don't get it, those crashes seem to be alright. Those crashes all have a Release Channel of "unknown" (at least, the 5 I checked all did) which prevents us from normalizing them or using then in reports. When we checked this earlier, such crashes constituted only 0.2% of all crashes sumbitted, and were a result of crashes which happened too soon after startup to collect useful data. However, the crashes you linked appear to have something different wrong, causing release channel not to be recorded. I'll test later to see if we've had a jump in release_channel=unknown, or if it's affiliated with a particular signature. Keep up the good work!
Daniel, Xavier: can you possibly help us analyze the reports that have no release_channel reported to work out what is common/special about them?
actually, let's see if Saptarshi might be up for the task. He might be able to use RIPHE to figure it out quickly.
Adding bug 704383 to this bug. Reports with unknown channel were originally fixed, and are apparently broken again. Please continue discussion of this particular data issue on that bug.
Depends on: 704383
Daniel, Not necessary; we figured out the cause of that issue and it's been addressed.
Depends on: 725396
It seems the biggest part of the problem cannot be solved whatsoever. Moving to 2.4.3 to run further verifications.
Target Milestone: 2.4.2 → 2.4.3
Target Milestone: 2.4.3 → 2.4.4
Target Milestone: 2.4.4 → 2.5
Target Milestone: 2.5 → 2.5.1
Target Milestone: 2.5.1 → 2.5.3
Target Milestone: 3 → 4
I swear I'll get that done someday! :)
Target Milestone: 4 → 5
Target Milestone: 5 → 6
Target Milestone: 6 → 7
Target Milestone: 7 → 9
Target Milestone: 9 → 10
Target Milestone: 10 → Future
This has been around for too long, and we never heard any complaint since we fixed the first issue. I assume this is fixed.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: Future → 10
Product: Socorro → Socorro Graveyard
You need to log in before you can comment on or make changes to this bug.