Closed Bug 444749 Opened 16 years ago Closed 15 years ago

Complex or long searches are timing out

Categories

(Socorro :: General, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: smichaud, Assigned: morgamic)

References

()

Details

For the last day or so, every search I've performed at
http://crash-stats.mozilla.com/ has returned "Error 500. Internal
Server Error".

This includes even the simplest searches -- like for topcrashers.
OS: Linux → All
Hardware: PC → All
The database is pretty heavily loaded and we have a few ideas on reducing the load.  However, it will probably be next week by the time we put them into production.
Assignee: server-ops → aravind
The socorro servers have been badly overloaded for months.  But it's
only in the last couple of days that (as far as I can tell) they've
become completely non-functional.

Is something down?

Has the server setup been changed recently?
I wouldn't say they are completely non-functional.  Its aggregate reporting thats broken currently.  Viewing individual reports works just fine.

We have been revamping the back-end recently and some of the database tables have gotten out of control - thats the reason for the recent problems with some of the reporting.
(In reply to comment #3)

OK, I stand corrected.  I just redid a search by bug ID that worked a
couple of days ago ... and it still works.

Probably the largest volume of searches is by bug ID (from
about:crashes), so it's good they're working.  But it'd still be nice
to get the other searches working again, even if only sub-optimally.

Good luck!  I'm glad I'm not the one who has to fix this :-)
Summary: Socorro servers returning Error 500 on every search → Socorro servers returning Error 500 on every aggregate search
No longer blocks: 445244
No longer blocks: 445246
Waiting on code fixes from morgamic.
Assignee: aravind → morgamic
QA Contact: justin → mrz
Assignee: morgamic → aravind
(In reply to comment #6)
> Waiting on code fixes from morgamic.
> 
This comment was almost a month ago, any updates/other bugs to watch?
We have code updates, aggregate tables and such.  Waiting on me to push this stuff out.
hi, aravind.  any ETA on your push?

thanks!
This bug is very odd and still exists for one and a half month. There are a lot of crash reports and no way to search the database. If we really want to decrease the amount of crashers on trunk we should fix it asap. There is no crash analysis possible at the moment. Nearly everything is returning the error 500. Even the search for top crashers.

Any chance to push all the stuff out within the next days?
We are working on it and the code is currently in our staging environment.  Pushing this to production will involve some downtime and stuff.  We will probably be pushing this out later this week or next week if it doesn't make it this week.

I understand the frustration that there are no aggregate reports available, but there are other changes to the app (like crash report throttling, etc) that take priority.
The Pylons version was replaced with a PHP version written by Les Orchard (webdev).  Most of the aggregate queries work properly now.  There are some known bugs with the new version, and we're working on those as well -- but please file a bug if you see something.  Resolving for now.

Known:
1) missing bonsai links
2) topcrashers pages returning strange results (issue with topcrasher cron)
3) larger queries still take a long time (need to move query page over to topcrashers summary table)
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
cool....

looks like source line info is in the raw dump tab for those looking for a work around for problem #1 above
morgamic: I'm still seeing 500 errors whenever a report selects a sufficient amount of data. Should I file a new bug? This one is in mozilla.org:server operations, but should I file the new one in webtools:socorro?
I also am unable to view any crash reports.   From the topcrashers dashboard, clicking on the link to any crash report will just spin and timeout.   Should this bug be reopened?
Here's a more specific example

1) goto http://crash-stats.mozilla.com/?do_query=1&product=Firefox&version=Firefox%3A3.0.2pre&query_search=signature&query_type=contains&query=&date=&range_value=2&range_unit=weeks
2) click on a report: (eg. MultiByteToWideChar)  
3) Site times out, and page shows: Connection Interrupted

"The connection to the server was reset while the page was loading.

The network link was interrupted while negotiating a connection. Please try again."

4) open error console: 

Error: parsers is undefined
Source File: http://crash-stats.mozilla.com/js/jquery/plugins/ui/jquery.tablesorter.min.js
Line: 2
Assignee: aravind → nobody
Component: Server Operations → Socorro
Product: mozilla.org → Webtools
QA Contact: mrz → socorro
Summary: Socorro servers returning Error 500 on every aggregate search → Socorro queries are timing out
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The report list is of concern - that shouldn't be timing out.  Reopening this to address that issue.  Will file another bug for the topcrasher byversion and bybranch pages separately.
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1b1pre) Gecko/20080908024232 SeaMonkey/2.0a1pre

With the query link in comment #17 (just clicking the link above) I get:

/!\  Connection Interrupted

     The document contains no data.

     The network link was interrupted while negotiating a connection. Please try again.

     [Try again]

Clicking "Try again" brings the same error page up again (after the timeout). It isn't a modem disconnection since I can post to Bugzilla. Clearing the error console then trying again leaves the error console empty.

OTOH I could reach crash-stats.mozilla.org tonight for an _individual_ crash report, bp-ae5e6623-7df7-11dd-abe9-001a4bd43e5c .

Clicking the link "Find a report" at top then querying for SeaMonkey crashes filed within one day (there's at least mine) http://crash-stats.mozilla.com/?do_query=1&product=SeaMonkey&query_search=signature&query_type=contains&query=&date=&range_value=1&range_unit=days brings up 5 signatures including one (23 reports) on Windows, three (1 report each) on Mac, and mine on Linux.

However, "Top Crashers" for SeaMonkey 2.0a1pre brings up 3 signatures, all with the same build ID, and all with zero crashes in all 4 column (Total, Win, Lin, Mac). This looks suspect to me but I suppose it would be a different bug.

After all that, I see a number of irrelevant warnings and errors in the Error Console, and 4 warnings (no errors) pertaining to crash-stats.mozilla.org:
/!\ Warning: Error in parsing value for 'filter'.  Declaration dropped.
    Source File: http://crash-stats.mozilla.com/css/flora/flora.slider.css
    Line: 4
/!\ Warning: Error in parsing value for 'filter'.  Declaration dropped.
    Source File: http://crash-stats.mozilla.com/css/flora/flora.tabs.css
    Line: 82
/!\ Warning: Expected ',' or '{' but found 'html'.  Ruleset ignored due to bad selector.
    Source File: http://crash-stats.mozilla.com/css/flora/flora.datepicker.css
    Line: 37
/!\ Warning: Error in parsing value for 'filter'.  Declaration dropped.
    Source File: http://crash-stats.mozilla.com/css/flora/flora.datepicker.css
    Line: 174
Target Milestone: --- → 0.6
morgamic: what is the bug number for the top crasher reports issues?
Getting that filed shortly, I'll update this bug.
Blocks: 454640
(In reply to comment #22)
> Getting that filed shortly, I'll update this bug.


Bug 454640 filed
These are not only searches for top crasher reports but also for normal stack frames. I never get a result when starting a search for a special frame.
Status: REOPENED → NEW
Summary: Socorro queries are timing out → Socorro queries are timing out (Connection Interrupted)
Searching for a stack frame is still not working.
Severity: major → blocker
Blocks: 454872
PHP + Memcache + ProxyCache + Cluster should mean we're in good shape.  Reopen with URLs if you are still experiencing timeouts.
Status: NEW → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
I'd narrow it down to a shorter time frame or just firefox release versions.  Link worked for me.
Don't know what you tell you -- I'm going to make fixing search response times for queries like this a high priority.
It probably worked for you because I already heated up the caches, right?
Yea, probably.  We'll keep this open but it's going to be low priority because queries for an entire branch over an entire month for a string match are fairly uncommon.  We're looking at q1 or q2 next year unless it blocks something big.
Summary: Socorro queries are timing out (Connection Interrupted) → Complex or long searches are timing out
as morphed, this prob no longer blocks bug 454640, 454872
It does blocks me from working on bug 454872.
It still times out when searching for stack frames.
Depends on: 465360
Mats - we'll spend time on bug 465360 which should fix the stack frame search perf issues.
No longer blocks: 454872
Depends on: 432450
Assignee: nobody → morgamic
Status: REOPENED → ASSIGNED
Also not that the partitioning bug 432450 will increase performance and prevent costly scans.
Depends on: 465657
Priority: -- → P1
Depends on: 466103
Blocks: 468727
morgamic: It seems that now even non-complex queries are timing out. There's a couple other bugs on file about this now... bug 468405 and bug 474037. I think these are all the same, but you'd know better... We really need Socorro queries working. :-/
I hear you.  First thing I'd like to do is speed up deploying the partitioning work.  That is going to happen in the next week or so, and if we can get it done sooner than later it'd alleviate most of these issues.  

We have other options -- we will have to discuss these first, though.  We will have an update by Wednesday morning so we can get un-stuck.
Can we consider both a short and long term solution?
Short term ugly solution is to just copy the whole setup over to a mirror so we can query history (January 20 and earlier). Basically I don't usually t need the very latest up-to-the-minute crashes, and I'm wiling to go to a different server for those queries.

Long term we can circle back to fixing the whole system so we don't need that bifurcation. But, it seems to me that enabling crash-fixing work is more important than making everyone wait for the elegant solution.

Of course, I probably am spewing rubbish since I don't know how this is set up. Just saying an ugly solution would work fine for most urgent work right now.
Blocks: 432397
rather bad again today
For some time (weeks? months?) I've only been able to get results in the last hour.
This is blocking bug 468727, which itself is blocking the release. It's a very common crash for any HP laptop user with a fingerprint reader. We can't get help from the vendor, so we must find out what the regression range is.

I haven't yet heard why the entire database can't just be copied to a mirror that doesn't take in new crashes. We need to be able to do some crash archeology.
This should be fixed as part of database partitioning that will start tomorrow night and go (potentially) through Friday. Try on Monday next week and see if you still have issues.

See also: http://blog.mozilla.com/webdev/2009/01/20/socorro-database-partitioning-is-coming/
(In reply to comment #42)
> I haven't yet heard why the entire database can't just be copied to a mirror
> that doesn't take in new crashes. We need to be able to do some crash
> archeology.

The database is huge, and the people that would be needed to make something like that happen are actively working on the partitioning fix mentioned above, which should be the real fix here.
Also, it's not clear to me that making a readonly copy of the db would make queries any better. The problem is simply that the database grows too large, and postgres can't cope. The new partitioning scheme should ensure that no partition gets too large, such that the query planner can execute efficient queries no matter how many total crash reports we accumulate.
all good news.

question - given "we’re going to chunk only the most recent four weeks of data and leave the rest as a single oversize partition." (which I think implies 5 partitions), is it believed that a reasonably scoped query will be able to search as much as 60 days, or even the 120 days of history mentioned in http://blog.mozilla.com/webdev/2009/01/20/socorro-database-partitioning-is-coming/

I was able to get a range for Bug 468727 using "occurring before".  The search required >60 days range from today.
Something happened towards the end of last week (on Thursday and
Friday), and things are now marginally better.  But they're still
pretty bad.

Over the weekend (when the Socorro servers are presumably less busy),
I got a fairly complex search (from bug 476292 comment #0) to work
over a range of 96 hours.  But now I seem to max out at 8 hours:

http://crash-stats.mozilla.com/?do_query=1&product=Firefox&query_search=stack&query_type=exact&query=nsCocoaWindow%3A%3AShow%28int%29&date=&range_value=8&range_unit=hours

By the way, neither of these searches works (or worked) as-is.  In
both cases I searched over 1 hour, then 2 hours, then 4 hours (and so
forth).  This is cumbersome, but better than nothing.

There's still room for improvement, though :-)
(In reply to comment #47)
> Something happened towards the end of last week (on Thursday and
> Friday), and things are now marginally better.  

partitioning happened, Thurs-Sat iirc, or at least one part of it.
http://blog.mozilla.com/webdev/ -- bug 432450


> Over the weekend (when the Socorro servers are presumably less busy),
> I got a fairly complex search (from bug 476292 comment #0) to work
> over a range of 96 hours.  But now I seem to max out at 8 hours:
> 
> http://crash-stats.mozilla.com/?do_query=1&product=Firefox&query_search=stack&query_type=exact&query=nsCocoaWindow%3A%3AShow%28int%29&date=&range_value=8&range_unit=hours

8 hours worked straight off for me. impressive. 12hr too. but 18hr fails. still that's a pretty hefty improvement.

however, a simple top of stack, 3 week query fails http://crash-stats.mozilla.com/?do_query=1&product=Thunderbird&query_search=signature&query_type=contains&query=nsMsgLocalMailFolder%3A%3AAddMessage&date=&range_value=3&range_unit=weeks
The irony is that partitioning didn't actually happen.  The same trouble that's causing these queries to take so long and/or fail also thwarted the queries that repartitioning required.  The database was restored from backup on Friday night as it became apparent that repartitioning could not happen in a reasonable amount of time.

We will reattempt partitioning after we resolve some of the database maintenance difficulties that balked us.
(In reply to comment #48)

> 8 hours worked straight off for me.

This is probably because my search was still cached.

> 12hr too. but 18hr fails. still that's a pretty hefty improvement.

This I can't explain, though.  I tried 12 hours a couple of times
before I gave up.  Maybe it's just random variations in the degree to
which the servers are busy.

(In reply to comment #49)

Oh, well ... a bit of the placebo effect, I suppose :-)

Thinking that repartitioning was done (at least partly done) made me
try harder to get searches to work.
Update here regarding partitioning:
http://blog.mozilla.com/webdev/2009/02/02/socorro-partitioning-rolled-back/

Next attempt will be Thursday.  After that push, new data will be in separate partitions which allows us the ability to repartition older data offline and restore it to the operational database.  That should help us avoid extended downtimes in the future.
(Carrying over duplicate nomination from bug 422908)
Flags: blocking1.9.1?
Flags: blocking1.9.1? → blocking1.9.1-
Not aware of any current timeouts.  Marking as FIXED.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago15 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.