Closed Bug 658072 Opened 13 years ago Closed 12 years ago

Searching for 3 weeks of plugin crash data often returns a 500 Internal Server Error/times out

Categories

(Socorro :: Webapp, task, P2)

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: stephend, Assigned: rhelmer)

References

()

Details

Adding to 1.7.8 just for triage; what do you think, Laura?
Target Milestone: --- → 1.7.8
Assignee: nobody → rhelmer
This looks like a long-running query to me, looking in the pg_stats table I see:

SELECT COUNT(DISTINCT reports.signature) ... (JOIN plugins ON plugin_reports.plugin_id)

and so forth. I can tell from the comment (which is quite handy actually) that it's coming from:

http://code.google.com/p/socorro/source/browse/trunk/webapp-php/application/models/common.php#110

This does not look like it would be a very fast query, and it should be running against top_crashes_by_signature or some other table not reports, but even so I think the primary reason it's timing out will have to do with crash-stats-dev being a very underpowered all-in-one install of Socorro.

I don't think this is something we can easily fix in 1.7.8, but we can fix this as we move queries into middleware for 2.0
Priority: -- → P1
Target Milestone: 1.7.8 → 2.0
Target Milestone: 2.0 → 2.1
Priority: P1 → P2
Target Milestone: 2.1 → 2.3
Target Milestone: 2.3 → 2.4
For realtime search, I think using an index like ES is the way to go rather than trying to search the reports table.

As of 2.2, querying reports table is a lot more expensive than it used to be since we need to do a pretty hairy set of JOINs to determine what the beta number is.

If we want to continue to support searching via postgres without ES, then we should consider using the reports_clean table that Josh is working on (we're starting to use it for reports).

The only downside I can see is that reports_clean is not real-time, but an indexable search engine like ES is probably better suited to this anyway. We *could* support updating reports_clean at the same time ES is updated (by processors), but I'm not sure this level of complexity is worth it, especially since if it's not the configuration we use it's not going to get the same level of testing.
Hmm ok lonnen was noticing us timing out (going over 30 sec max exec time in PHP) on a search so I looked into how it's being done now that it's in the middleware, it's pretty much the same.

Josh, the first query here is just being used to get a total count for pagination; could this be done more quickly (maybe as part of the second query)?

SELECT count(DISTINCT r.signature) FROM reports r
WHERE r.date_processed 
BETWEEN '2011-11-02T17:04:49'
AND '2011-11-09T17:04:49' 
AND r.product=E'Firefox'

SELECT r.signature, count(r.id) as total, 
count(CASE WHEN (r.os_name = E'Windows NT') THEN 1 END) AS is_windows,        
count(CASE WHEN (r.os_name = E'Mac OS X') THEN 1 END) AS is_mac,        
count(CASE WHEN (r.os_name = E'Linux') THEN 1 END) AS is_linux,        
count(CASE WHEN (r.os_name = E'Solaris') THEN 1 END) AS is_solaris, 
SUM (CASE WHEN r.hangid IS NULL THEN 0  ELSE 1 END) AS numhang, 
SUM (CASE WHEN r.process_type IS NULL THEN 0  ELSE 1 END) AS numplugin 
FROM reports r 
WHERE r.date_processed BETWEEN '2011-11-02T17:04:49' AND '2011-11-09T17:04:49'
AND r.product=E'Firefox' GROUP BY r.signature 
ORDER BY total DESC
LIMIT 100
OFFSET 0
Status: NEW → ASSIGNED
To clarify, it's only the first query that's very slow in comment 4 (the one that gets the total count for pagination purposes), the second one is just a few seconds which is totally acceptable for this type of search.
This is what's known as the "pagination count issue" and is common to all databases (including PG, MySQL, Oracle and Lucene IME) ... counting the total number of results often takes much, much longer than displaying one page of results.  There are a number of different workarounds for this problem.  However, they don't seem worth solving in Postgres if these searches are going to be running on ES in a few weeks.

If there's a good reason to solve this issue on PostgreSQL, then we can revisit it.
(In reply to [:jberkus] Josh Berkus from comment #6)
> This is what's known as the "pagination count issue" and is common to all
> databases (including PG, MySQL, Oracle and Lucene IME) ... counting the
> total number of results often takes much, much longer than displaying one
> page of results.  There are a number of different workarounds for this
> problem.  However, they don't seem worth solving in Postgres if these
> searches are going to be running on ES in a few weeks.
> 
> If there's a good reason to solve this issue on PostgreSQL, then we can
> revisit it.

Yeah we discussed a bit in IRC and none of the workarounds are worth it given we're moving search to ES soon.

We could bump up the PHP exec time (currently at 30s) so people could actually get the result, although it may take a few minutes.
Also to clarify what's going on with the "500 ISE", that happens if the loadbalancer times out the request. We bumped this up a while ago, so now we only hit PHP's max exec time (which looks to the browser like the connection unexpected closed).
Component: Socorro → General
Product: Webtools → Socorro
Assignee: rhelmer → nobody
Component: General → Webapp
QA Contact: socorro → webapp
Assignee: nobody → rhelmer
Bumped max_execution_time from 30 to 120 on crash-stats-dev.

This env only has one week of history loaded right now (per https://wiki.mozilla.org/Socorro:Releases#PostgreSQL_Database_Snapshot_Tracking) so it may not be possible to fully test this until we get to staging at least.
Looks like the zeus timeout for crash-stats-dev is still really low, we'll need to make sure that one (as well as stage and prod) are set to the same level.)
Target Milestone: 2.4 → 2.4.1
Target Milestone: 2.4.1 → 2.4.2
Depends on: 724061
OK this requires a PHP config change, done on crash-stats-dev already and documented https://wiki.mozilla.org/Socorro:Releases/2.4.2
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Whiteboard: [testonstage]
Turns out the timeout for the connection between the webapp and mware (zeus) was not bumped in bug 724061.

This is fixed now.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Whiteboard: [testonstage]
You need to log in before you can comment on or make changes to this bug.