Last Comment Bug 444749 - Complex or long searches are timing out
: Complex or long searches are timing out
Status: RESOLVED FIXED
:
Product: Socorro
Classification: Server Software
Component: General (show other bugs)
: other
: All All
: P1 blocker with 3 votes (vote)
: 0.6
Assigned To: Michael Morgan [:morgamic]
: socorro
Mentors:
http://crash-stats.mozilla.com/
: 422908 444993 446378 465621 (view as bug list)
Depends on: 432450 465360 465657 466103
Blocks: 432397 454640 468727
  Show dependency treegraph
 
Reported: 2008-07-11 08:10 PDT by Steven Michaud [:smichaud] (Retired)
Modified: 2013-12-27 14:31 PST (History)
33 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description Steven Michaud [:smichaud] (Retired) 2008-07-11 08:10:40 PDT
For the last day or so, every search I've performed at
http://crash-stats.mozilla.com/ has returned "Error 500. Internal
Server Error".

This includes even the simplest searches -- like for topcrashers.
Comment 1 Aravind Gottipati [:aravind] 2008-07-11 10:56:59 PDT
The database is pretty heavily loaded and we have a few ideas on reducing the load.  However, it will probably be next week by the time we put them into production.
Comment 2 Steven Michaud [:smichaud] (Retired) 2008-07-11 11:55:22 PDT
The socorro servers have been badly overloaded for months.  But it's
only in the last couple of days that (as far as I can tell) they've
become completely non-functional.

Is something down?

Has the server setup been changed recently?
Comment 3 Aravind Gottipati [:aravind] 2008-07-11 16:58:18 PDT
I wouldn't say they are completely non-functional.  Its aggregate reporting thats broken currently.  Viewing individual reports works just fine.

We have been revamping the back-end recently and some of the database tables have gotten out of control - thats the reason for the recent problems with some of the reporting.
Comment 4 Steven Michaud [:smichaud] (Retired) 2008-07-11 17:52:17 PDT
(In reply to comment #3)

OK, I stand corrected.  I just redid a search by bug ID that worked a
couple of days ago ... and it still works.

Probably the largest volume of searches is by bug ID (from
about:crashes), so it's good they're working.  But it'd still be nice
to get the other searches working again, even if only sub-optimally.

Good luck!  I'm glad I'm not the one who has to fix this :-)
Comment 5 Tony Mechelynck [:tonymec] 2008-07-13 07:07:06 PDT
*** Bug 444993 has been marked as a duplicate of this bug. ***
Comment 6 Aravind Gottipati [:aravind] 2008-07-18 14:40:45 PDT
Waiting on code fixes from morgamic.
Comment 7 Samuel Sidler (old account; do not CC) 2008-07-21 06:16:08 PDT
*** Bug 446378 has been marked as a duplicate of this bug. ***
Comment 8 Mark Banner (:standard8) 2008-08-14 01:28:52 PDT
(In reply to comment #6)
> Waiting on code fixes from morgamic.
> 
This comment was almost a month ago, any updates/other bugs to watch?
Comment 9 Aravind Gottipati [:aravind] 2008-08-14 11:39:24 PDT
We have code updates, aggregate tables and such.  Waiting on me to push this stuff out.
Comment 10 Marc Bejarano 2008-08-22 14:45:11 PDT
hi, aravind.  any ETA on your push?

thanks!
Comment 11 Henrik Skupin (:whimboo) 2008-08-24 12:46:46 PDT
This bug is very odd and still exists for one and a half month. There are a lot of crash reports and no way to search the database. If we really want to decrease the amount of crashers on trunk we should fix it asap. There is no crash analysis possible at the moment. Nearly everything is returning the error 500. Even the search for top crashers.

Any chance to push all the stuff out within the next days?
Comment 12 Aravind Gottipati [:aravind] 2008-08-26 08:40:16 PDT
We are working on it and the code is currently in our staging environment.  Pushing this to production will involve some downtime and stuff.  We will probably be pushing this out later this week or next week if it doesn't make it this week.

I understand the frustration that there are no aggregate reports available, but there are other changes to the app (like crash report throttling, etc) that take priority.
Comment 13 Michael Morgan [:morgamic] 2008-09-05 17:05:31 PDT
The Pylons version was replaced with a PHP version written by Les Orchard (webdev).  Most of the aggregate queries work properly now.  There are some known bugs with the new version, and we're working on those as well -- but please file a bug if you see something.  Resolving for now.

Known:
1) missing bonsai links
2) topcrashers pages returning strange results (issue with topcrasher cron)
3) larger queries still take a long time (need to move query page over to topcrashers summary table)
Comment 14 chris hofmann 2008-09-05 19:26:59 PDT
cool....

looks like source line info is in the raw dump tab for those looking for a work around for problem #1 above
Comment 15 Bob Clary [:bc:] 2008-09-06 13:28:28 PDT
morgamic: I'm still seeing 500 errors whenever a report selects a sufficient amount of data. Should I file a new bug? This one is in mozilla.org:server operations, but should I file the new one in webtools:socorro?
Comment 16 Tony Chung [:tchung] 2008-09-08 15:26:31 PDT
I also am unable to view any crash reports.   From the topcrashers dashboard, clicking on the link to any crash report will just spin and timeout.   Should this bug be reopened?
Comment 17 Tony Chung [:tchung] 2008-09-08 15:42:21 PDT
Here's a more specific example

1) goto http://crash-stats.mozilla.com/?do_query=1&product=Firefox&version=Firefox%3A3.0.2pre&query_search=signature&query_type=contains&query=&date=&range_value=2&range_unit=weeks
2) click on a report: (eg. MultiByteToWideChar)  
3) Site times out, and page shows: Connection Interrupted

"The connection to the server was reset while the page was loading.

The network link was interrupted while negotiating a connection. Please try again."

4) open error console: 

Error: parsers is undefined
Source File: http://crash-stats.mozilla.com/js/jquery/plugins/ui/jquery.tablesorter.min.js
Line: 2
Comment 18 Michael Morgan [:morgamic] 2008-09-08 15:51:40 PDT
The report list is of concern - that shouldn't be timing out.  Reopening this to address that issue.  Will file another bug for the topcrasher byversion and bybranch pages separately.
Comment 19 Tony Mechelynck [:tonymec] 2008-09-08 20:56:34 PDT
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1b1pre) Gecko/20080908024232 SeaMonkey/2.0a1pre

With the query link in comment #17 (just clicking the link above) I get:

/!\  Connection Interrupted

     The document contains no data.

     The network link was interrupted while negotiating a connection. Please try again.

     [Try again]

Clicking "Try again" brings the same error page up again (after the timeout). It isn't a modem disconnection since I can post to Bugzilla. Clearing the error console then trying again leaves the error console empty.

OTOH I could reach crash-stats.mozilla.org tonight for an _individual_ crash report, bp-ae5e6623-7df7-11dd-abe9-001a4bd43e5c .

Clicking the link "Find a report" at top then querying for SeaMonkey crashes filed within one day (there's at least mine) http://crash-stats.mozilla.com/?do_query=1&product=SeaMonkey&query_search=signature&query_type=contains&query=&date=&range_value=1&range_unit=days brings up 5 signatures including one (23 reports) on Windows, three (1 report each) on Mac, and mine on Linux.

However, "Top Crashers" for SeaMonkey 2.0a1pre brings up 3 signatures, all with the same build ID, and all with zero crashes in all 4 column (Total, Win, Lin, Mac). This looks suspect to me but I suppose it would be a different bug.

After all that, I see a number of irrelevant warnings and errors in the Error Console, and 4 warnings (no errors) pertaining to crash-stats.mozilla.org:
/!\ Warning: Error in parsing value for 'filter'.  Declaration dropped.
    Source File: http://crash-stats.mozilla.com/css/flora/flora.slider.css
    Line: 4
/!\ Warning: Error in parsing value for 'filter'.  Declaration dropped.
    Source File: http://crash-stats.mozilla.com/css/flora/flora.tabs.css
    Line: 82
/!\ Warning: Expected ',' or '{' but found 'html'.  Ruleset ignored due to bad selector.
    Source File: http://crash-stats.mozilla.com/css/flora/flora.datepicker.css
    Line: 37
/!\ Warning: Error in parsing value for 'filter'.  Declaration dropped.
    Source File: http://crash-stats.mozilla.com/css/flora/flora.datepicker.css
    Line: 174
Comment 20 Michael Morgan [:morgamic] 2008-09-10 09:27:19 PDT
*** Bug 422908 has been marked as a duplicate of this bug. ***
Comment 21 Marc Bejarano 2008-09-10 10:52:30 PDT
morgamic: what is the bug number for the top crasher reports issues?
Comment 22 Michael Morgan [:morgamic] 2008-09-10 11:01:59 PDT
Getting that filed shortly, I'll update this bug.
Comment 23 Tony Chung [:tchung] 2008-09-10 12:23:33 PDT
(In reply to comment #22)
> Getting that filed shortly, I'll update this bug.


Bug 454640 filed
Comment 24 Henrik Skupin (:whimboo) 2008-09-10 12:47:13 PDT
These are not only searches for top crasher reports but also for normal stack frames. I never get a result when starting a search for a special frame.
Comment 25 Mats Palmgren (vacation) 2008-09-24 16:43:13 PDT
Searching for a stack frame is still not working.
Comment 26 Michael Morgan [:morgamic] 2008-11-14 09:40:18 PST
PHP + Memcache + ProxyCache + Cluster should mean we're in good shape.  Reopen with URLs if you are still experiencing timeouts.
Comment 28 Michael Morgan [:morgamic] 2008-11-14 10:52:26 PST
I'd narrow it down to a shorter time frame or just firefox release versions.  Link worked for me.
Comment 29 Michael Morgan [:morgamic] 2008-11-14 10:53:11 PST
Don't know what you tell you -- I'm going to make fixing search response times for queries like this a high priority.
Comment 30 Jeffrey Baker 2008-11-14 10:54:13 PST
It probably worked for you because I already heated up the caches, right?
Comment 31 Michael Morgan [:morgamic] 2008-11-14 11:03:44 PST
Yea, probably.  We'll keep this open but it's going to be low priority because queries for an entire branch over an entire month for a string match are fairly uncommon.  We're looking at q1 or q2 next year unless it blocks something big.
Comment 32 Wayne Mery (:wsmwk, NI for questions) 2008-11-14 12:16:25 PST
as morphed, this prob no longer blocks bug 454640, 454872
Comment 33 Mats Palmgren (vacation) 2008-11-17 06:41:29 PST
It does blocks me from working on bug 454872.
It still times out when searching for stack frames.
Comment 34 Michael Morgan [:morgamic] 2008-11-17 09:46:21 PST
Mats - we'll spend time on bug 465360 which should fix the stack frame search perf issues.
Comment 35 Michael Morgan [:morgamic] 2008-11-18 19:30:40 PST
Also not that the partitioning bug 432450 will increase performance and prevent costly scans.
Comment 36 Michael Morgan [:morgamic] 2008-11-18 19:31:39 PST
*** Bug 465621 has been marked as a duplicate of this bug. ***
Comment 37 Samuel Sidler (old account; do not CC) 2009-01-19 16:26:51 PST
morgamic: It seems that now even non-complex queries are timing out. There's a couple other bugs on file about this now... bug 468405 and bug 474037. I think these are all the same, but you'd know better... We really need Socorro queries working. :-/
Comment 38 Michael Morgan [:morgamic] 2009-01-19 18:47:25 PST
I hear you.  First thing I'd like to do is speed up deploying the partitioning work.  That is going to happen in the next week or so, and if we can get it done sooner than later it'd alleviate most of these issues.  

We have other options -- we will have to discuss these first, though.  We will have an update by Wednesday morning so we can get un-stuck.
Comment 39 Aaron Leventhal 2009-01-20 01:21:37 PST
Can we consider both a short and long term solution?
Short term ugly solution is to just copy the whole setup over to a mirror so we can query history (January 20 and earlier). Basically I don't usually t need the very latest up-to-the-minute crashes, and I'm wiling to go to a different server for those queries.

Long term we can circle back to fixing the whole system so we don't need that bifurcation. But, it seems to me that enabling crash-fixing work is more important than making everyone wait for the elegant solution.

Of course, I probably am spewing rubbish since I don't know how this is set up. Just saying an ugly solution would work fine for most urgent work right now.
Comment 40 Wayne Mery (:wsmwk, NI for questions) 2009-01-28 07:51:46 PST
rather bad again today
Comment 41 Steven Michaud [:smichaud] (Retired) 2009-01-28 08:08:41 PST
For some time (weeks? months?) I've only been able to get results in the last hour.
Comment 42 Aaron Leventhal 2009-01-28 08:14:33 PST
This is blocking bug 468727, which itself is blocking the release. It's a very common crash for any HP laptop user with a fingerprint reader. We can't get help from the vendor, so we must find out what the regression range is.

I haven't yet heard why the entire database can't just be copied to a mirror that doesn't take in new crashes. We need to be able to do some crash archeology.
Comment 43 Samuel Sidler (old account; do not CC) 2009-01-28 08:16:25 PST
This should be fixed as part of database partitioning that will start tomorrow night and go (potentially) through Friday. Try on Monday next week and see if you still have issues.

See also: http://blog.mozilla.com/webdev/2009/01/20/socorro-database-partitioning-is-coming/
Comment 44 Ted Mielczarek [:ted.mielczarek] 2009-01-28 08:21:22 PST
(In reply to comment #42)
> I haven't yet heard why the entire database can't just be copied to a mirror
> that doesn't take in new crashes. We need to be able to do some crash
> archeology.

The database is huge, and the people that would be needed to make something like that happen are actively working on the partitioning fix mentioned above, which should be the real fix here.
Comment 45 Ted Mielczarek [:ted.mielczarek] 2009-01-28 08:22:35 PST
Also, it's not clear to me that making a readonly copy of the db would make queries any better. The problem is simply that the database grows too large, and postgres can't cope. The new partitioning scheme should ensure that no partition gets too large, such that the query planner can execute efficient queries no matter how many total crash reports we accumulate.
Comment 46 Wayne Mery (:wsmwk, NI for questions) 2009-01-28 09:00:18 PST
all good news.

question - given "we’re going to chunk only the most recent four weeks of data and leave the rest as a single oversize partition." (which I think implies 5 partitions), is it believed that a reasonably scoped query will be able to search as much as 60 days, or even the 120 days of history mentioned in http://blog.mozilla.com/webdev/2009/01/20/socorro-database-partitioning-is-coming/

I was able to get a range for Bug 468727 using "occurring before".  The search required >60 days range from today.
Comment 47 Steven Michaud [:smichaud] (Retired) 2009-02-02 08:57:34 PST
Something happened towards the end of last week (on Thursday and
Friday), and things are now marginally better.  But they're still
pretty bad.

Over the weekend (when the Socorro servers are presumably less busy),
I got a fairly complex search (from bug 476292 comment #0) to work
over a range of 96 hours.  But now I seem to max out at 8 hours:

http://crash-stats.mozilla.com/?do_query=1&product=Firefox&query_search=stack&query_type=exact&query=nsCocoaWindow%3A%3AShow%28int%29&date=&range_value=8&range_unit=hours

By the way, neither of these searches works (or worked) as-is.  In
both cases I searched over 1 hour, then 2 hours, then 4 hours (and so
forth).  This is cumbersome, but better than nothing.

There's still room for improvement, though :-)
Comment 48 Wayne Mery (:wsmwk, NI for questions) 2009-02-02 09:29:09 PST
(In reply to comment #47)
> Something happened towards the end of last week (on Thursday and
> Friday), and things are now marginally better.  

partitioning happened, Thurs-Sat iirc, or at least one part of it.
http://blog.mozilla.com/webdev/ -- bug 432450


> Over the weekend (when the Socorro servers are presumably less busy),
> I got a fairly complex search (from bug 476292 comment #0) to work
> over a range of 96 hours.  But now I seem to max out at 8 hours:
> 
> http://crash-stats.mozilla.com/?do_query=1&product=Firefox&query_search=stack&query_type=exact&query=nsCocoaWindow%3A%3AShow%28int%29&date=&range_value=8&range_unit=hours

8 hours worked straight off for me. impressive. 12hr too. but 18hr fails. still that's a pretty hefty improvement.

however, a simple top of stack, 3 week query fails http://crash-stats.mozilla.com/?do_query=1&product=Thunderbird&query_search=signature&query_type=contains&query=nsMsgLocalMailFolder%3A%3AAddMessage&date=&range_value=3&range_unit=weeks
Comment 49 K Lars Lohn [:lars] [:klohn] 2009-02-02 09:44:16 PST
The irony is that partitioning didn't actually happen.  The same trouble that's causing these queries to take so long and/or fail also thwarted the queries that repartitioning required.  The database was restored from backup on Friday night as it became apparent that repartitioning could not happen in a reasonable amount of time.

We will reattempt partitioning after we resolve some of the database maintenance difficulties that balked us.
Comment 50 Steven Michaud [:smichaud] (Retired) 2009-02-02 10:06:07 PST
(In reply to comment #48)

> 8 hours worked straight off for me.

This is probably because my search was still cached.

> 12hr too. but 18hr fails. still that's a pretty hefty improvement.

This I can't explain, though.  I tried 12 hours a couple of times
before I gave up.  Maybe it's just random variations in the degree to
which the servers are busy.

(In reply to comment #49)

Oh, well ... a bit of the placebo effect, I suppose :-)

Thinking that repartitioning was done (at least partly done) made me
try harder to get searches to work.
Comment 51 Michael Morgan [:morgamic] 2009-02-02 10:33:46 PST
Update here regarding partitioning:
http://blog.mozilla.com/webdev/2009/02/02/socorro-partitioning-rolled-back/

Next attempt will be Thursday.  After that push, new data will be in separate partitions which allows us the ability to repartition older data offline and restore it to the operational database.  That should help us avoid extended downtimes in the future.
Comment 52 Mike Beltzner [:beltzner, not reading bugmail] 2009-02-03 10:26:49 PST
(Carrying over duplicate nomination from bug 422908)
Comment 53 Michael Morgan [:morgamic] 2009-09-24 08:48:15 PDT
Not aware of any current timeouts.  Marking as FIXED.

Note You need to log in before you can comment on or make changes to this bug.