Closed Bug 679065 Opened 14 years ago Closed 14 years ago

Some Fennec 6.0b6 topcrash signatures return an empty report list

Categories

(Socorro :: General, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kairo, Assigned: rhelmer)

References

()

Details

If this is a code error, I'd ship a 2.2.1 for it, seems pretty important.
Severity: normal → major
Target Milestone: --- → 2.3
This looks like an error in new tcbs aggregation. I don't see any reports for Fennec 6.0b6 with this signature for the selected time range: breakpad=> select build_id from product_versions as pv join product_version_builds as pvb on (pv.product_version_id = pvb.product_version_id) and product_name = 'Fennec' and version_string = '6.0b6'; build_id ---------------- 20110810164700 20110810181057 20110810183719 20110810192454 20110810184133 (5 rows) breakpad=> SELECT product, version, build FROM reports WHERE product = 'Fennec' AND version = '6.0' AND signature = 'mozilla::layout::RenderFrameParent::GetLayerManager' AND reports.date_processed BETWEEN CAST('2011-08-14' AS TIMESTAMP WITHOUT TIME ZONE) - CAST('7 days' AS INTERVAL) AND CAST('2011-08-14' AS TIMESTAMP WITHOUT TIME ZONE) ORDER BY reports.date_processed DESC; product | version | build ---------+---------+---------------- Fennec | 6.0 | 20110804021247 Fennec | 6.0 | 20110804021247 (2 rows) -------------------- Yet it appears in TCBS: -------------------- breakpad=> select product_name, version_string, signature, report_date from tcbs join signatures as s on s.signature_id = tcbs.signature_id join product_versions as pv on tcbs.product_version_id = pv.product_version_id where signature = 'mozilla::layout::RenderFrameParent::GetLayerManager' and product_name = 'Fennec' and version_string = '6.0b6'; product_name | version_string | signature | report_date --------------+----------------+-----------------------------------------------------+------------- Fennec | 6.0b6 | mozilla::layout::RenderFrameParent::GetLayerManager | 2011-08-14 (1 row)
Assignee: nobody → josh
Umm, but the CSV has one: robert@robert:/mnt/mozilla/projects/socorro> grep "mozilla::layout::RenderFrameParent::GetLayerManager" 2011-08-14/20110814-pub-crashdata.csv mozilla::layout::RenderFrameParent::GetLayerManager URL (removed) http://crash-stats.mozilla.com/report/index/eea72d6a-1866-40c1-b572-64d572110814 201108140834 201108140833 \N Fennec 6.0 20110810164700 6 Linux 2.6.32.9 ?arch?arm | None | 0 0xe0 603680 \N 4 \N ../../dist/include/nsIDocument.h checked [blank] \N SIGSEGV \N motorola DROID2 | verizon/droid2_vzw/cdma_droid2/droid2:2.2/VZW/23.20:user/ota-rel-keys,release-keys 104916 \N That's the 20110810164700 buildID from what I'm seeing...
https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Alayout%3A%3ARenderFrameParent%3A%3AGetLayerManager&version=Fennec%3A6.0b6 works (without a date/time range). Are we running into a problem because TCBS is using UTC and /reports/list is assuming Pacific?
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #3) > Umm, but the CSV has one: > > robert@robert:/mnt/mozilla/projects/socorro> grep > "mozilla::layout::RenderFrameParent::GetLayerManager" > 2011-08-14/20110814-pub-crashdata.csv > mozilla::layout::RenderFrameParent::GetLayerManager URL (removed) > http://crash-stats.mozilla.com/report/index/eea72d6a-1866-40c1-b572- > 64d572110814 201108140834 201108140833 \N Fennec 6.0 > 20110810164700 6 Linux 2.6.32.9 ?arch?arm | None | 0 0xe0 > 603680 \N 4 \N ../../dist/include/nsIDocument.h > checked [blank] \N SIGSEGV \N motorola DROID2 | > verizon/droid2_vzw/cdma_droid2/droid2:2.2/VZW/23.20:user/ota-rel-keys, > release-keys 104916 \N > > That's the 20110810164700 buildID from what I'm seeing... Hmm good point... so I notice the cutoff for that reports query is '2011-08-14' (the right-hand-side of the "BETWEEN" clause), so that's not going to show up. So the problem here isn't that TCBS aggregation is wrong, it's that the reports query only shows up to today Pacific. Note that the way the date decided is not new code, this is starting to look like a pre-existing bug (possibly exacerbated by new tcbs being UTC? not sure).
(In reply to Robert Helmer [:rhelmer] from comment #5) > So the problem here isn't that TCBS aggregation is wrong, it's that the > reports query only shows up to today Pacific. It shows everything before *yesterday* (midnight) Pacific (2011-08-14 is yesterday, not today, and the crash happened yesterday morning). ;-)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #6) > (In reply to Robert Helmer [:rhelmer] from comment #5) > > So the problem here isn't that TCBS aggregation is wrong, it's that the > > reports query only shows up to today Pacific. > > It shows everything before *yesterday* (midnight) Pacific (2011-08-14 is > yesterday, not today, and the crash happened yesterday morning). ;-) Right, thanks :)
The time zone issues are expected. The only real cure for them is a gradual movement away from pacific timestamps and towards UTC time and/or timestamptz. There's so much code to be changed though, that we need to do this in stages.
(In reply to Josh Berkus from comment #8) > The time zone issues are expected. 1) This is not just timezone, this is almost a whole day that it's wrong. 2) In this case, it should not be hard to fix if TCBS knows the exact time its report ends and pass it in the link (possibly with some TZ conversion for now).
Flags: in-testsuite?
Flags: in-litmus?
This might or might not get fixed by 2.2.4. We'll see.
Just to clarify, this is probably UI or mware. In the example in here, we have list?range_value=7&range_unit=days&date=2011-08-14& and that seems to show "7 days up to the timestamp 2011-08-14" when the report it is linked from shows "7 days up to including 2011-08-14" which equals to "7 days up to timestamp 2011-08-15 (00:00:00)". This should be easier to trigger with our one-day topcrasher reports, e.g. https://crash-stats.mozilla.com/topcrasher/byversion/Fennec/7.0b5/1/browser
After hashing this out with KaiRo, this is apparently an middleware issue where the date limits used in the different screens aren't phrased the same. As such, reassigning to bsavage and moving to 2.3.
Assignee: josh → bsavage
Target Milestone: 2.3 → 2.3.1
Target Milestone: 2.3.1 → 2.3.2
Which type of report do you want to have? Do you want to have it "up to and including <day>" or "up to but excluding <day>"?
(In reply to Brandon Savage [:brandon] from comment #13) > Which type of report do you want to have? Do you want to have it "up to and > including <day>" or "up to but excluding <day>"? Whatever matches the topcrash report that link is on! I don't care what way you resolve it, either change the link in the topcrash report to go to the next day or change the list to be inclusive of the queried day.
Target Milestone: 2.3.2 → 2.3.3
It appears that the reports table is reporting different data than the tcbs table. The report/list page uses the reports table to fetch it's data; should it be using something else? If not, then this appears to be both a code problem and a data bug, since tcbs should report data consistent with reports.
This is still a problem as of 2.3.2.
Assignee: bsavage → josh
Brandon, Please give me the query being passed by the middleware.
The following query is being used by the middleware in socorro/services/tcbs/modern.py: WITH tcbs_r as ( SELECT signature_id, signature, product_name, version_string, sum(report_count) as report_count, sum(win_count) as win_count, sum(lin_count) as lin_count, sum(mac_count) as mac_count, sum(hang_count) as hang_count, plugin_count(process_type,report_count) as plugin_count, content_count(process_type,report_count) as content_count FROM tcbs JOIN signatures USING (signature_id) JOIN product_versions USING (product_version_id) WHERE product_name = '%s' AND version_string = '%s' AND report_date BETWEEN '%s' AND '%s' %s GROUP BY signature_id, signature, product_name, version_string ), tcbs_window AS ( SELECT tcbs_r.*, sum(report_count) over () as total_crashes, dense_rank() over (order by report_count desc) as ranking FROM tcbs_r ) SELECT signature, report_count, win_count, lin_count, mac_count, hang_count, plugin_count, content_count, report_count / total_crashes::float as percent_of_total FROM tcbs_window ORDER BY report_count DESC LIMIT %s
Brandon: I'm going to need to capture the actual query passed when you click through. Can we set up an isolated test of this on devdb later today? Thanks.
Brandon and I debugged this. First of all, the query above is not the one which is being passed to search reports. That's a different query. The issue is actually a date issue, but not specifically one for the time zone. The problem is that when TCBS shows "1 day of 11-06" what it means is "11-06 00:00:00 UTC TO 11-07 00:00:00 UTC". But what gets passed in the reports search is: BETWEEN CAST('2011-11-06' AS TIMESTAMP WITHOUT TIME ZONE) - CAST('1 days' AS INTERVAL) AND CAST('2011-11-06' AS TIMESTAMP WITHOUT TIME ZONE) In other words, '11-06 00:00:00 PST' TO '11-06 00:00:00 PST'. A period with only 7 hours of overlap with the TCBS day. I don't know if it's the TCBS screen which is displaying the wrong dates or the reports screen which is searching on the wrong dates. But the two need to be consistent.
Entire queries per Brandon's request: WITH tcbs_r as ( SELECT signature_id, signature, product_name, version_string, sum(report_count) as report_count, sum(win_count) as win_count, sum(lin_count) as lin_count, sum(mac_count) as mac_count, sum(hang_count) as hang_count, plugin_count(process_type,report_count) as plugin_count, content_count(process_type,report_count) as content_count FROM tcbs JOIN signatures USING (signature_id) JOIN product_versions USING (product_version_id) WHERE product_name = 'Fennec' AND version_string = '8.0b3' AND report_date BETWEEN '2011-11-05' AND '2011-11-06' AND process_type = 'browser' GROUP BY signature_id, signature, product_name, version_string ), tcbs_window AS ( SELECT tcbs_r.*, sum(report_count) over () as total_crashes, dense_rank() over (order by report_count desc) as ranking FROM tcbs_r ) SELECT signature, report_count, win_count, lin_count, mac_count, hang_count, plugin_count, content_count, report_count / total_crashes::float as percent_of_total FROM tcbs_window ORDER BY report_count DESC LIMIT 300 ",,,,,,,,,"" SELECT reports.date_processed, reports.uptime, reports.user_comments, reports.uuid, reports.product, reports.version, reports.build, reports.signature, reports.url, reports.os_name, reports.os_version, reports.cpu_name, reports.cpu_info, reports.address, reports.reason, reports.last_crash, reports.install_age, reports.hangid, reports.process_type, ( reports.client_crash_date - ( reports.install_age * INTERVAL '1 second' ) ) as install_time, reports_duplicates.duplicate_of FROM reports JOIN product_versions ON reports.version = product_versions.release_version AND reports.product = product_versions.product_name LEFT OUTER JOIN reports_duplicates ON reports.uuid = reports_duplicates.uuid WHERE reports.signature = 'RNG_FileUpdate' AND ((reports.product = 'Fennec' AND product_versions.version_string = '8.0b3' AND reports.version = product_versions.release_version AND reports.release_channel ILIKE 'beta' AND product_versions.build_type = 'Beta' AND EXISTS ( SELECT 1 FROM product_version_builds WHERE product_versions.product_version_id = product_version_builds.product_version_id AND build_numeric(reports.build) = product_version_builds.build_id ))) AND reports.date_processed BETWEEN CAST('2011-11-06' AS TIMESTAMP WITHOUT TIME ZONE) - CAST('1 days' AS INTERVAL) AND CAST('2011-11-06' AS TIMESTAMP WITHOUT TIME ZONE) ORDER BY reports.date_processed DESC LIMIT 500 OFFSET 0;
so as you can see, the issue comes from the change from oldtcbs to newtcbs. In oldtcbs, the "window_end" column was a timestamp, so "2011-11-05 to 2011-11-06" meant "all of 11/05, stopping at midnight". Whereas in newtcbs, the "report_date" column is a DATE, which means that "2011-11-05 to 2011-11-06" means "all of 11/05 and 11/06".
Target Milestone: 2.3.3 → 2.3.4
Assignee: josh → chris.lonnen
I just randomly was able to reproduce this while working on a different bug (generating a minimal dev DB), so I am in a good position to test and fix (it's blocking me anyway).
Assignee: chris.lonnen → rhelmer
Status: NEW → ASSIGNED
lonnen r?: https://github.com/mozilla/socorro/pull/161 After consulting with Josh and testing locally, here is a fix to treat /report/list queries as Pacific (since we can assume with newtcbs that incoming links will be UTC). All of this code will go away and use the common search code when bug 678642 is fixed, so not going super deep into this. We should actually audit the search code to make sure it's doing this correctly (it will be an issue for Elastic Search queries as well, since those are Pacific while newtcbs is UTC). The only downside I can think of is that now pre-rapid-release (old TCBS) links will have the problem described in this bug, but as that's deprecated I'm not sure it's worth the larger change we'd need to make to this code to make it determine this and choose the right method.
If it's one or the other, fennec has our priority. Tested locally, r+.
Commit pushed to https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/217f8fa514045c8e7149540bb5fbc09e14832cce Merge pull request #161 from rhelmer/bug679065-fix-report-list-query bug 679065 - incoming links will be UTC dates for newtcbs, be explicit a...
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.