Closed
Bug 658072
Opened 13 years ago
Closed 12 years ago
Searching for 3 weeks of plugin crash data often returns a 500 Internal Server Error/times out
Categories
(Socorro :: Webapp, task, P2)
Socorro
Webapp
Tracking
(Not tracked)
VERIFIED
FIXED
2.4.2
People
(Reporter: stephend, Assigned: rhelmer)
References
()
Details
Note: this slow-running query doesn't fail 100%; Not sure if this is related to bug 652880, so filing anyway. BTW, this is fine on prod, but as I'd like to see an automated test for this, don't want to have it failing. STR: Load https://crash-stats-dev.allizom.org/query/query?product=Firefox&version=ALL%3AALL&range_value=3&range_unit=weeks&date=05%2F18%2F2011+14%3A27%3A28&query_search=signature&query_type=contains&query=&reason=&build_id=&process_type=plugin&hang_type=any&plugin_field=filename&plugin_query_type=exact&plugin_query=&do_query=1 Expected Results: Search should return results Actual Results: [14:29:45.868] GET https://crash-stats-dev.allizom.org/query/query?product=Firefox&version=ALL%3AALL&range_value=3&range_unit=weeks&date=05%2F18%2F2011+14%3A27%3A28&query_search=signature&query_type=contains&query=&reason=&build_id=&process_type=plugin&hang_type=any&plugin_field=filename&plugin_query_type=exact&plugin_query=&do_query=1 [HTTP/1.1 500 Internal Server Error 31386ms] Zeus displays a timeout. Rob, what do the logs say?
Flags: in-testsuite?
Flags: in-litmus?
Reporter | ||
Comment 1•13 years ago
|
||
Adding to 1.7.8 just for triage; what do you think, Laura?
Target Milestone: --- → 1.7.8
Updated•13 years ago
|
Assignee: nobody → rhelmer
Assignee | ||
Comment 2•13 years ago
|
||
This looks like a long-running query to me, looking in the pg_stats table I see: SELECT COUNT(DISTINCT reports.signature) ... (JOIN plugins ON plugin_reports.plugin_id) and so forth. I can tell from the comment (which is quite handy actually) that it's coming from: http://code.google.com/p/socorro/source/browse/trunk/webapp-php/application/models/common.php#110 This does not look like it would be a very fast query, and it should be running against top_crashes_by_signature or some other table not reports, but even so I think the primary reason it's timing out will have to do with crash-stats-dev being a very underpowered all-in-one install of Socorro. I don't think this is something we can easily fix in 1.7.8, but we can fix this as we move queries into middleware for 2.0
Assignee | ||
Updated•13 years ago
|
Priority: -- → P1
Assignee | ||
Updated•13 years ago
|
Target Milestone: 1.7.8 → 2.0
Assignee | ||
Updated•13 years ago
|
Target Milestone: 2.0 → 2.1
Assignee | ||
Updated•13 years ago
|
Priority: P1 → P2
Updated•13 years ago
|
Target Milestone: 2.1 → 2.3
Updated•13 years ago
|
Target Milestone: 2.3 → 2.4
Assignee | ||
Comment 3•13 years ago
|
||
For realtime search, I think using an index like ES is the way to go rather than trying to search the reports table. As of 2.2, querying reports table is a lot more expensive than it used to be since we need to do a pretty hairy set of JOINs to determine what the beta number is. If we want to continue to support searching via postgres without ES, then we should consider using the reports_clean table that Josh is working on (we're starting to use it for reports). The only downside I can see is that reports_clean is not real-time, but an indexable search engine like ES is probably better suited to this anyway. We *could* support updating reports_clean at the same time ES is updated (by processors), but I'm not sure this level of complexity is worth it, especially since if it's not the configuration we use it's not going to get the same level of testing.
Assignee | ||
Comment 4•13 years ago
|
||
Hmm ok lonnen was noticing us timing out (going over 30 sec max exec time in PHP) on a search so I looked into how it's being done now that it's in the middleware, it's pretty much the same. Josh, the first query here is just being used to get a total count for pagination; could this be done more quickly (maybe as part of the second query)? SELECT count(DISTINCT r.signature) FROM reports r WHERE r.date_processed BETWEEN '2011-11-02T17:04:49' AND '2011-11-09T17:04:49' AND r.product=E'Firefox' SELECT r.signature, count(r.id) as total, count(CASE WHEN (r.os_name = E'Windows NT') THEN 1 END) AS is_windows, count(CASE WHEN (r.os_name = E'Mac OS X') THEN 1 END) AS is_mac, count(CASE WHEN (r.os_name = E'Linux') THEN 1 END) AS is_linux, count(CASE WHEN (r.os_name = E'Solaris') THEN 1 END) AS is_solaris, SUM (CASE WHEN r.hangid IS NULL THEN 0 ELSE 1 END) AS numhang, SUM (CASE WHEN r.process_type IS NULL THEN 0 ELSE 1 END) AS numplugin FROM reports r WHERE r.date_processed BETWEEN '2011-11-02T17:04:49' AND '2011-11-09T17:04:49' AND r.product=E'Firefox' GROUP BY r.signature ORDER BY total DESC LIMIT 100 OFFSET 0
Status: NEW → ASSIGNED
Assignee | ||
Comment 5•13 years ago
|
||
To clarify, it's only the first query that's very slow in comment 4 (the one that gets the total count for pagination purposes), the second one is just a few seconds which is totally acceptable for this type of search.
Comment 6•13 years ago
|
||
This is what's known as the "pagination count issue" and is common to all databases (including PG, MySQL, Oracle and Lucene IME) ... counting the total number of results often takes much, much longer than displaying one page of results. There are a number of different workarounds for this problem. However, they don't seem worth solving in Postgres if these searches are going to be running on ES in a few weeks. If there's a good reason to solve this issue on PostgreSQL, then we can revisit it.
Assignee | ||
Comment 7•13 years ago
|
||
(In reply to [:jberkus] Josh Berkus from comment #6) > This is what's known as the "pagination count issue" and is common to all > databases (including PG, MySQL, Oracle and Lucene IME) ... counting the > total number of results often takes much, much longer than displaying one > page of results. There are a number of different workarounds for this > problem. However, they don't seem worth solving in Postgres if these > searches are going to be running on ES in a few weeks. > > If there's a good reason to solve this issue on PostgreSQL, then we can > revisit it. Yeah we discussed a bit in IRC and none of the workarounds are worth it given we're moving search to ES soon. We could bump up the PHP exec time (currently at 30s) so people could actually get the result, although it may take a few minutes.
Assignee | ||
Comment 8•13 years ago
|
||
Also to clarify what's going on with the "500 ISE", that happens if the loadbalancer times out the request. We bumped this up a while ago, so now we only hit PHP's max exec time (which looks to the browser like the connection unexpected closed).
Updated•13 years ago
|
Component: Socorro → General
Product: Webtools → Socorro
Assignee | ||
Updated•13 years ago
|
Assignee: rhelmer → nobody
Component: General → Webapp
QA Contact: socorro → webapp
Assignee | ||
Updated•13 years ago
|
Assignee: nobody → rhelmer
Assignee | ||
Comment 9•13 years ago
|
||
Bumped max_execution_time from 30 to 120 on crash-stats-dev. This env only has one week of history loaded right now (per https://wiki.mozilla.org/Socorro:Releases#PostgreSQL_Database_Snapshot_Tracking) so it may not be possible to fully test this until we get to staging at least.
Assignee | ||
Comment 10•13 years ago
|
||
Looks like the zeus timeout for crash-stats-dev is still really low, we'll need to make sure that one (as well as stage and prod) are set to the same level.)
Updated•13 years ago
|
Target Milestone: 2.4 → 2.4.1
Assignee | ||
Updated•12 years ago
|
Target Milestone: 2.4.1 → 2.4.2
Assignee | ||
Comment 11•12 years ago
|
||
OK this requires a PHP config change, done on crash-stats-dev already and documented https://wiki.mozilla.org/Socorro:Releases/2.4.2
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 12•12 years ago
|
||
Looks good on -dev; like to test on stage before verifying (just to be sure, since it'll have moar data): [13:53:42.508] GET https://crash-stats-dev.allizom.org/query/query?product=Firefox&version=ALL%3AALL&range_value=3&range_unit=weeks&date=05%2F18%2F2011+14%3A27%3A28&query_search=signature&query_type=contains&query=&reason=&build_id=&process_type=plugin&hang_type=any&plugin_field=filename&plugin_query_type=exact&plugin_query=&do_query=1 [HTTP/1.1 200 OK 231ms]
Updated•12 years ago
|
Whiteboard: [testonstage]
Comment 13•12 years ago
|
||
Reopening, it look like I can still generate a 500 error with the following url https://crash-stats.allizom.org/query/query?product=Firefox&version=ALL%3AALL&range_value=1&range_unit=weeks&date=02%2F10%2F2012+23%3A03%3A46&query_search=signature&query_type=contains&query=&reason=&build_id=&process_type=any&hang_type=any&do_query=1 Is causing this error: [15:03:54.156] GET https://crash-stats.allizom.org/query/query?product=Firefox&version=ALL%3AALL&range_value=1&range_unit=weeks&date=02%2F10%2F2012+23%3A03%3A46&query_search=signature&query_type=contains&query=&reason=&build_id=&process_type=any&hang_type=any&do_query=1 [HTTP/1.1 500 Internal Server Error 30770ms] In arecibo it looks like this might be similar to bug 721899 - hard to tell though.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 14•12 years ago
|
||
Turns out the timeout for the connection between the webapp and mware (zeus) was not bumped in bug 724061. This is fixed now.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 15•12 years ago
|
||
Bumping to QA verified. thx all who dug into the bug. Verified on stage. [16:18:41.052] GET https://crash-stats.allizom.org/query/query?product=Firefox&version=ALL%3AALL&range_value=1&range_unit=weeks&date=02%2F10%2F2012+23%3A03%3A46&query_search=signature&query_type=contains&query=&reason=&build_id=&process_type=any&hang_type=any&do_query=1 [HTTP/1.1 200 OK 60545ms]
Status: RESOLVED → VERIFIED
Updated•12 years ago
|
Whiteboard: [testonstage]
You need to log in
before you can comment on or make changes to this bug.
Description
•