Closed
Bug 444749
Opened 16 years ago
Closed 15 years ago
Complex or long searches are timing out
Categories
(Socorro :: General, task, P1)
Socorro
General
Tracking
(Not tracked)
RESOLVED
FIXED
0.6
People
(Reporter: smichaud, Assigned: morgamic)
References
()
Details
For the last day or so, every search I've performed at
http://crash-stats.mozilla.com/ has returned "Error 500. Internal
Server Error".
This includes even the simplest searches -- like for topcrashers.
Reporter | ||
Updated•16 years ago
|
OS: Linux → All
Hardware: PC → All
Comment 1•16 years ago
|
||
The database is pretty heavily loaded and we have a few ideas on reducing the load. However, it will probably be next week by the time we put them into production.
Assignee: server-ops → aravind
Reporter | ||
Comment 2•16 years ago
|
||
The socorro servers have been badly overloaded for months. But it's
only in the last couple of days that (as far as I can tell) they've
become completely non-functional.
Is something down?
Has the server setup been changed recently?
Comment 3•16 years ago
|
||
I wouldn't say they are completely non-functional. Its aggregate reporting thats broken currently. Viewing individual reports works just fine.
We have been revamping the back-end recently and some of the database tables have gotten out of control - thats the reason for the recent problems with some of the reporting.
Reporter | ||
Comment 4•16 years ago
|
||
(In reply to comment #3)
OK, I stand corrected. I just redid a search by bug ID that worked a
couple of days ago ... and it still works.
Probably the largest volume of searches is by bug ID (from
about:crashes), so it's good they're working. But it'd still be nice
to get the other searches working again, even if only sub-optimally.
Good luck! I'm glad I'm not the one who has to fix this :-)
Summary: Socorro servers returning Error 500 on every search → Socorro servers returning Error 500 on every aggregate search
Updated•16 years ago
|
Updated•16 years ago
|
QA Contact: justin → mrz
Updated•16 years ago
|
Assignee: morgamic → aravind
Comment 8•16 years ago
|
||
(In reply to comment #6)
> Waiting on code fixes from morgamic.
>
This comment was almost a month ago, any updates/other bugs to watch?
Comment 9•16 years ago
|
||
We have code updates, aggregate tables and such. Waiting on me to push this stuff out.
Comment 10•16 years ago
|
||
hi, aravind. any ETA on your push?
thanks!
Comment 11•16 years ago
|
||
This bug is very odd and still exists for one and a half month. There are a lot of crash reports and no way to search the database. If we really want to decrease the amount of crashers on trunk we should fix it asap. There is no crash analysis possible at the moment. Nearly everything is returning the error 500. Even the search for top crashers.
Any chance to push all the stuff out within the next days?
Comment 12•16 years ago
|
||
We are working on it and the code is currently in our staging environment. Pushing this to production will involve some downtime and stuff. We will probably be pushing this out later this week or next week if it doesn't make it this week.
I understand the frustration that there are no aggregate reports available, but there are other changes to the app (like crash report throttling, etc) that take priority.
Assignee | ||
Comment 13•16 years ago
|
||
The Pylons version was replaced with a PHP version written by Les Orchard (webdev). Most of the aggregate queries work properly now. There are some known bugs with the new version, and we're working on those as well -- but please file a bug if you see something. Resolving for now.
Known:
1) missing bonsai links
2) topcrashers pages returning strange results (issue with topcrasher cron)
3) larger queries still take a long time (need to move query page over to topcrashers summary table)
Assignee | ||
Updated•16 years ago
|
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Comment 14•16 years ago
|
||
cool....
looks like source line info is in the raw dump tab for those looking for a work around for problem #1 above
Comment 15•16 years ago
|
||
morgamic: I'm still seeing 500 errors whenever a report selects a sufficient amount of data. Should I file a new bug? This one is in mozilla.org:server operations, but should I file the new one in webtools:socorro?
Comment 16•16 years ago
|
||
I also am unable to view any crash reports. From the topcrashers dashboard, clicking on the link to any crash report will just spin and timeout. Should this bug be reopened?
Comment 17•16 years ago
|
||
Here's a more specific example
1) goto http://crash-stats.mozilla.com/?do_query=1&product=Firefox&version=Firefox%3A3.0.2pre&query_search=signature&query_type=contains&query=&date=&range_value=2&range_unit=weeks
2) click on a report: (eg. MultiByteToWideChar)
3) Site times out, and page shows: Connection Interrupted
"The connection to the server was reset while the page was loading.
The network link was interrupted while negotiating a connection. Please try again."
4) open error console:
Error: parsers is undefined
Source File: http://crash-stats.mozilla.com/js/jquery/plugins/ui/jquery.tablesorter.min.js
Line: 2
Assignee | ||
Updated•16 years ago
|
Assignee: aravind → nobody
Component: Server Operations → Socorro
Product: mozilla.org → Webtools
QA Contact: mrz → socorro
Summary: Socorro servers returning Error 500 on every aggregate search → Socorro queries are timing out
Assignee | ||
Updated•16 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 18•16 years ago
|
||
The report list is of concern - that shouldn't be timing out. Reopening this to address that issue. Will file another bug for the topcrasher byversion and bybranch pages separately.
Comment 19•16 years ago
|
||
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1b1pre) Gecko/20080908024232 SeaMonkey/2.0a1pre
With the query link in comment #17 (just clicking the link above) I get:
/!\ Connection Interrupted
The document contains no data.
The network link was interrupted while negotiating a connection. Please try again.
[Try again]
Clicking "Try again" brings the same error page up again (after the timeout). It isn't a modem disconnection since I can post to Bugzilla. Clearing the error console then trying again leaves the error console empty.
OTOH I could reach crash-stats.mozilla.org tonight for an _individual_ crash report, bp-ae5e6623-7df7-11dd-abe9-001a4bd43e5c .
Clicking the link "Find a report" at top then querying for SeaMonkey crashes filed within one day (there's at least mine) http://crash-stats.mozilla.com/?do_query=1&product=SeaMonkey&query_search=signature&query_type=contains&query=&date=&range_value=1&range_unit=days brings up 5 signatures including one (23 reports) on Windows, three (1 report each) on Mac, and mine on Linux.
However, "Top Crashers" for SeaMonkey 2.0a1pre brings up 3 signatures, all with the same build ID, and all with zero crashes in all 4 column (Total, Win, Lin, Mac). This looks suspect to me but I suppose it would be a different bug.
After all that, I see a number of irrelevant warnings and errors in the Error Console, and 4 warnings (no errors) pertaining to crash-stats.mozilla.org:
/!\ Warning: Error in parsing value for 'filter'. Declaration dropped.
Source File: http://crash-stats.mozilla.com/css/flora/flora.slider.css
Line: 4
/!\ Warning: Error in parsing value for 'filter'. Declaration dropped.
Source File: http://crash-stats.mozilla.com/css/flora/flora.tabs.css
Line: 82
/!\ Warning: Expected ',' or '{' but found 'html'. Ruleset ignored due to bad selector.
Source File: http://crash-stats.mozilla.com/css/flora/flora.datepicker.css
Line: 37
/!\ Warning: Error in parsing value for 'filter'. Declaration dropped.
Source File: http://crash-stats.mozilla.com/css/flora/flora.datepicker.css
Line: 174
Assignee | ||
Updated•16 years ago
|
Target Milestone: --- → 0.6
Comment 21•16 years ago
|
||
morgamic: what is the bug number for the top crasher reports issues?
Assignee | ||
Comment 22•16 years ago
|
||
Getting that filed shortly, I'll update this bug.
Comment 23•16 years ago
|
||
Comment 24•16 years ago
|
||
These are not only searches for top crasher reports but also for normal stack frames. I never get a result when starting a search for a special frame.
Status: REOPENED → NEW
Updated•16 years ago
|
Summary: Socorro queries are timing out → Socorro queries are timing out (Connection Interrupted)
Comment 25•16 years ago
|
||
Searching for a stack frame is still not working.
Severity: major → blocker
Assignee | ||
Comment 26•16 years ago
|
||
PHP + Memcache + ProxyCache + Cluster should mean we're in good shape. Reopen with URLs if you are still experiencing timeouts.
Status: NEW → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Comment 27•16 years ago
|
||
I got a timeout on my first search.
http://crash-stats.mozilla.com/?do_query=1&product=Firefox&branch=1.9.1&platform=linux&query_search=signature&query_type=contains&query=flash&date=&range_value=1&range_unit=months
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 28•16 years ago
|
||
I'd narrow it down to a shorter time frame or just firefox release versions. Link worked for me.
Assignee | ||
Comment 29•16 years ago
|
||
Don't know what you tell you -- I'm going to make fixing search response times for queries like this a high priority.
Comment 30•16 years ago
|
||
It probably worked for you because I already heated up the caches, right?
Assignee | ||
Comment 31•16 years ago
|
||
Yea, probably. We'll keep this open but it's going to be low priority because queries for an entire branch over an entire month for a string match are fairly uncommon. We're looking at q1 or q2 next year unless it blocks something big.
Summary: Socorro queries are timing out (Connection Interrupted) → Complex or long searches are timing out
Comment 32•16 years ago
|
||
as morphed, this prob no longer blocks bug 454640, 454872
Comment 33•16 years ago
|
||
It does blocks me from working on bug 454872.
It still times out when searching for stack frames.
Assignee | ||
Comment 34•16 years ago
|
||
Mats - we'll spend time on bug 465360 which should fix the stack frame search perf issues.
Assignee | ||
Updated•16 years ago
|
Assignee: nobody → morgamic
Status: REOPENED → ASSIGNED
Assignee | ||
Comment 35•16 years ago
|
||
Also not that the partitioning bug 432450 will increase performance and prevent costly scans.
Comment 37•16 years ago
|
||
morgamic: It seems that now even non-complex queries are timing out. There's a couple other bugs on file about this now... bug 468405 and bug 474037. I think these are all the same, but you'd know better... We really need Socorro queries working. :-/
Assignee | ||
Comment 38•16 years ago
|
||
I hear you. First thing I'd like to do is speed up deploying the partitioning work. That is going to happen in the next week or so, and if we can get it done sooner than later it'd alleviate most of these issues.
We have other options -- we will have to discuss these first, though. We will have an update by Wednesday morning so we can get un-stuck.
Comment 39•16 years ago
|
||
Can we consider both a short and long term solution?
Short term ugly solution is to just copy the whole setup over to a mirror so we can query history (January 20 and earlier). Basically I don't usually t need the very latest up-to-the-minute crashes, and I'm wiling to go to a different server for those queries.
Long term we can circle back to fixing the whole system so we don't need that bifurcation. But, it seems to me that enabling crash-fixing work is more important than making everyone wait for the elegant solution.
Of course, I probably am spewing rubbish since I don't know how this is set up. Just saying an ugly solution would work fine for most urgent work right now.
Comment 40•16 years ago
|
||
rather bad again today
Reporter | ||
Comment 41•16 years ago
|
||
For some time (weeks? months?) I've only been able to get results in the last hour.
Comment 42•16 years ago
|
||
This is blocking bug 468727, which itself is blocking the release. It's a very common crash for any HP laptop user with a fingerprint reader. We can't get help from the vendor, so we must find out what the regression range is.
I haven't yet heard why the entire database can't just be copied to a mirror that doesn't take in new crashes. We need to be able to do some crash archeology.
Comment 43•16 years ago
|
||
This should be fixed as part of database partitioning that will start tomorrow night and go (potentially) through Friday. Try on Monday next week and see if you still have issues.
See also: http://blog.mozilla.com/webdev/2009/01/20/socorro-database-partitioning-is-coming/
Comment 44•16 years ago
|
||
(In reply to comment #42)
> I haven't yet heard why the entire database can't just be copied to a mirror
> that doesn't take in new crashes. We need to be able to do some crash
> archeology.
The database is huge, and the people that would be needed to make something like that happen are actively working on the partitioning fix mentioned above, which should be the real fix here.
Comment 45•16 years ago
|
||
Also, it's not clear to me that making a readonly copy of the db would make queries any better. The problem is simply that the database grows too large, and postgres can't cope. The new partitioning scheme should ensure that no partition gets too large, such that the query planner can execute efficient queries no matter how many total crash reports we accumulate.
Comment 46•16 years ago
|
||
all good news.
question - given "we’re going to chunk only the most recent four weeks of data and leave the rest as a single oversize partition." (which I think implies 5 partitions), is it believed that a reasonably scoped query will be able to search as much as 60 days, or even the 120 days of history mentioned in http://blog.mozilla.com/webdev/2009/01/20/socorro-database-partitioning-is-coming/
I was able to get a range for Bug 468727 using "occurring before". The search required >60 days range from today.
Reporter | ||
Comment 47•16 years ago
|
||
Something happened towards the end of last week (on Thursday and
Friday), and things are now marginally better. But they're still
pretty bad.
Over the weekend (when the Socorro servers are presumably less busy),
I got a fairly complex search (from bug 476292 comment #0) to work
over a range of 96 hours. But now I seem to max out at 8 hours:
http://crash-stats.mozilla.com/?do_query=1&product=Firefox&query_search=stack&query_type=exact&query=nsCocoaWindow%3A%3AShow%28int%29&date=&range_value=8&range_unit=hours
By the way, neither of these searches works (or worked) as-is. In
both cases I searched over 1 hour, then 2 hours, then 4 hours (and so
forth). This is cumbersome, but better than nothing.
There's still room for improvement, though :-)
Comment 48•16 years ago
|
||
(In reply to comment #47)
> Something happened towards the end of last week (on Thursday and
> Friday), and things are now marginally better.
partitioning happened, Thurs-Sat iirc, or at least one part of it.
http://blog.mozilla.com/webdev/ -- bug 432450
> Over the weekend (when the Socorro servers are presumably less busy),
> I got a fairly complex search (from bug 476292 comment #0) to work
> over a range of 96 hours. But now I seem to max out at 8 hours:
>
> http://crash-stats.mozilla.com/?do_query=1&product=Firefox&query_search=stack&query_type=exact&query=nsCocoaWindow%3A%3AShow%28int%29&date=&range_value=8&range_unit=hours
8 hours worked straight off for me. impressive. 12hr too. but 18hr fails. still that's a pretty hefty improvement.
however, a simple top of stack, 3 week query fails http://crash-stats.mozilla.com/?do_query=1&product=Thunderbird&query_search=signature&query_type=contains&query=nsMsgLocalMailFolder%3A%3AAddMessage&date=&range_value=3&range_unit=weeks
Comment 49•16 years ago
|
||
The irony is that partitioning didn't actually happen. The same trouble that's causing these queries to take so long and/or fail also thwarted the queries that repartitioning required. The database was restored from backup on Friday night as it became apparent that repartitioning could not happen in a reasonable amount of time.
We will reattempt partitioning after we resolve some of the database maintenance difficulties that balked us.
Reporter | ||
Comment 50•16 years ago
|
||
(In reply to comment #48)
> 8 hours worked straight off for me.
This is probably because my search was still cached.
> 12hr too. but 18hr fails. still that's a pretty hefty improvement.
This I can't explain, though. I tried 12 hours a couple of times
before I gave up. Maybe it's just random variations in the degree to
which the servers are busy.
(In reply to comment #49)
Oh, well ... a bit of the placebo effect, I suppose :-)
Thinking that repartitioning was done (at least partly done) made me
try harder to get searches to work.
Assignee | ||
Comment 51•16 years ago
|
||
Update here regarding partitioning:
http://blog.mozilla.com/webdev/2009/02/02/socorro-partitioning-rolled-back/
Next attempt will be Thursday. After that push, new data will be in separate partitions which allows us the ability to repartition older data offline and restore it to the operational database. That should help us avoid extended downtimes in the future.
Updated•16 years ago
|
Flags: blocking1.9.1? → blocking1.9.1-
Assignee | ||
Comment 53•15 years ago
|
||
Not aware of any current timeouts. Marking as FIXED.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago → 15 years ago
Resolution: --- → FIXED
Updated•13 years ago
|
Component: Socorro → General
Product: Webtools → Socorro
You need to log in
before you can comment on or make changes to this bug.
Description
•