658072 - Searching for 3 weeks of plugin crash data often returns a 500 Internal Server Error/times out

This looks like a long-running query to me, looking in the pg_stats table I see: SELECT COUNT(DISTINCT reports.signature) ... (JOIN plugins ON plugin_reports.plugin_id) and so forth. I can tell from the comment (which is quite handy actually) that it's coming from: http://code.google.com/p/socorro/source/browse/trunk/webapp-php/application/models/common.php#110 This does not look like it would be a very fast query, and it should be running against top_crashes_by_signature or some other table not reports, but even so I think the primary reason it's timing out will have to do with crash-stats-dev being a very underpowered all-in-one install of Socorro. I don't think this is something we can easily fix in 1.7.8, but we can fix this as we move queries into middleware for 2.0

Robert Helmer [:rhelmer]

Assignee

Updated

•

14 years ago

Priority: -- → P1

Robert Helmer [:rhelmer]

Assignee

Updated

•

14 years ago

Target Milestone: 1.7.8 → 2.0

Robert Helmer [:rhelmer]

Assignee

Updated

•

14 years ago

Target Milestone: 2.0 → 2.1

Robert Helmer [:rhelmer]

Assignee

Updated

•

14 years ago

Priority: P1 → P2

Laura Thomson :laura

Updated

•

14 years ago

Target Milestone: 2.1 → 2.3

Laura Thomson :laura

Updated

•

14 years ago

Target Milestone: 2.3 → 2.4

Robert Helmer [:rhelmer]

Assignee

Comment 3

•

14 years ago

For realtime search, I think using an index like ES is the way to go rather than trying to search the reports table. As of 2.2, querying reports table is a lot more expensive than it used to be since we need to do a pretty hairy set of JOINs to determine what the beta number is. If we want to continue to support searching via postgres without ES, then we should consider using the reports_clean table that Josh is working on (we're starting to use it for reports). The only downside I can see is that reports_clean is not real-time, but an indexable search engine like ES is probably better suited to this anyway. We *could* support updating reports_clean at the same time ES is updated (by processors), but I'm not sure this level of complexity is worth it, especially since if it's not the configuration we use it's not going to get the same level of testing.

Robert Helmer [:rhelmer]

Assignee

Comment 4

•

14 years ago

Hmm ok lonnen was noticing us timing out (going over 30 sec max exec time in PHP) on a search so I looked into how it's being done now that it's in the middleware, it's pretty much the same. Josh, the first query here is just being used to get a total count for pagination; could this be done more quickly (maybe as part of the second query)? SELECT count(DISTINCT r.signature) FROM reports r WHERE r.date_processed BETWEEN '2011-11-02T17:04:49' AND '2011-11-09T17:04:49' AND r.product=E'Firefox' SELECT r.signature, count(r.id) as total, count(CASE WHEN (r.os_name = E'Windows NT') THEN 1 END) AS is_windows, count(CASE WHEN (r.os_name = E'Mac OS X') THEN 1 END) AS is_mac, count(CASE WHEN (r.os_name = E'Linux') THEN 1 END) AS is_linux, count(CASE WHEN (r.os_name = E'Solaris') THEN 1 END) AS is_solaris, SUM (CASE WHEN r.hangid IS NULL THEN 0 ELSE 1 END) AS numhang, SUM (CASE WHEN r.process_type IS NULL THEN 0 ELSE 1 END) AS numplugin FROM reports r WHERE r.date_processed BETWEEN '2011-11-02T17:04:49' AND '2011-11-09T17:04:49' AND r.product=E'Firefox' GROUP BY r.signature ORDER BY total DESC LIMIT 100 OFFSET 0

Status: NEW → ASSIGNED

Robert Helmer [:rhelmer]

Assignee

Comment 5

•

14 years ago

To clarify, it's only the first query that's very slow in comment 4 (the one that gets the total count for pagination purposes), the second one is just a few seconds which is totally acceptable for this type of search.

[:jberkus] Josh Berkus

Comment 6

•

14 years ago

This is what's known as the "pagination count issue" and is common to all databases (including PG, MySQL, Oracle and Lucene IME) ... counting the total number of results often takes much, much longer than displaying one page of results. There are a number of different workarounds for this problem. However, they don't seem worth solving in Postgres if these searches are going to be running on ES in a few weeks. If there's a good reason to solve this issue on PostgreSQL, then we can revisit it.

Robert Helmer [:rhelmer]

Assignee

Comment 7

•

14 years ago

(In reply to [:jberkus] Josh Berkus from comment #6) > This is what's known as the "pagination count issue" and is common to all > databases (including PG, MySQL, Oracle and Lucene IME) ... counting the > total number of results often takes much, much longer than displaying one > page of results. There are a number of different workarounds for this > problem. However, they don't seem worth solving in Postgres if these > searches are going to be running on ES in a few weeks. > > If there's a good reason to solve this issue on PostgreSQL, then we can > revisit it. Yeah we discussed a bit in IRC and none of the workarounds are worth it given we're moving search to ES soon. We could bump up the PHP exec time (currently at 30s) so people could actually get the result, although it may take a few minutes.

Robert Helmer [:rhelmer]

Assignee

Comment 8

•

14 years ago

Also to clarify what's going on with the "500 ISE", that happens if the loadbalancer times out the request. We bumped this up a while ago, so now we only hit PHP's max exec time (which looks to the browser like the connection unexpected closed).

Nobody; OK to take it and work on it

Updated

•

14 years ago

Component: Socorro → General

Product: Webtools → Socorro

Robert Helmer [:rhelmer]

Assignee

Updated

•

14 years ago

Assignee: rhelmer → nobody

Component: General → Webapp

QA Contact: socorro → webapp

Robert Helmer [:rhelmer]

Assignee

Updated

•

14 years ago

Assignee: nobody → rhelmer

Robert Helmer [:rhelmer]

Assignee

Comment 9

•

14 years ago

Bumped max_execution_time from 30 to 120 on crash-stats-dev. This env only has one week of history loaded right now (per https://wiki.mozilla.org/Socorro:Releases#PostgreSQL_Database_Snapshot_Tracking) so it may not be possible to fully test this until we get to staging at least.

Robert Helmer [:rhelmer]

Assignee

Comment 10

•

14 years ago

Looks like the zeus timeout for crash-stats-dev is still really low, we'll need to make sure that one (as well as stage and prod) are set to the same level.)

Laura Thomson :laura

Updated

•

14 years ago

Target Milestone: 2.4 → 2.4.1

Robert Helmer [:rhelmer]

Assignee

Updated

•

14 years ago

Target Milestone: 2.4.1 → 2.4.2

Robert Helmer [:rhelmer]

Assignee

Updated

•

14 years ago

Depends on: 724061

Robert Helmer [:rhelmer]

Assignee

Comment 11

•

14 years ago

OK this requires a PHP config change, done on crash-stats-dev already and documented https://wiki.mozilla.org/Socorro:Releases/2.4.2

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 12

•

14 years ago

Looks good on -dev; like to test on stage before verifying (just to be sure, since it'll have moar data): [13:53:42.508] GET https://crash-stats-dev.allizom.org/query/query?product=Firefox&version=ALL%3AALL&range_value=3&range_unit=weeks&date=05%2F18%2F2011+14%3A27%3A28&query_search=signature&query_type=contains&query=&reason=&build_id=&process_type=plugin&hang_type=any&plugin_field=filename&plugin_query_type=exact&plugin_query=&do_query=1 [HTTP/1.1 200 OK 231ms]

Matt Brandt [:mbrandt]

Updated

•

14 years ago

Whiteboard: [testonstage]

Matt Brandt [:mbrandt]

Comment 13

•

14 years ago

Reopening, it look like I can still generate a 500 error with the following url https://crash-stats.allizom.org/query/query?product=Firefox&version=ALL%3AALL&range_value=1&range_unit=weeks&date=02%2F10%2F2012+23%3A03%3A46&query_search=signature&query_type=contains&query=&reason=&build_id=&process_type=any&hang_type=any&do_query=1 Is causing this error: [15:03:54.156] GET https://crash-stats.allizom.org/query/query?product=Firefox&version=ALL%3AALL&range_value=1&range_unit=weeks&date=02%2F10%2F2012+23%3A03%3A46&query_search=signature&query_type=contains&query=&reason=&build_id=&process_type=any&hang_type=any&do_query=1 [HTTP/1.1 500 Internal Server Error 30770ms] In arecibo it looks like this might be similar to bug 721899 - hard to tell though.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Robert Helmer [:rhelmer]

Assignee

Comment 14

•

14 years ago

Turns out the timeout for the connection between the webapp and mware (zeus) was not bumped in bug 724061. This is fixed now.

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

Matt Brandt [:mbrandt]

Comment 15

•

14 years ago

Bumping to QA verified. thx all who dug into the bug. Verified on stage. [16:18:41.052] GET https://crash-stats.allizom.org/query/query?product=Firefox&version=ALL%3AALL&range_value=1&range_unit=weeks&date=02%2F10%2F2012+23%3A03%3A46&query_search=signature&query_type=contains&query=&reason=&build_id=&process_type=any&hang_type=any&do_query=1 [HTTP/1.1 200 OK 60545ms]

Status: RESOLVED → VERIFIED

Matt Brandt [:mbrandt]

Updated

•

14 years ago

Whiteboard: [testonstage]