Closed
Bug 744159
Opened 13 years ago
Closed 11 years ago
Input staging is down/returning a 500 Internal Server Error
Categories
(Input Graveyard :: Search, defect)
Input Graveyard
Search
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: stephend, Unassigned)
References
()
Details
http://input.allizom.org/ is down, returning a 500:
[13:39:37.347] GET http://input.allizom.org/ [HTTP/1.1 500 Internal Server Error 59ms]
Comment 1•13 years ago
|
||
OperationalError: (1045, "Access denied for user 'input_user'@'10.2.10.103' (using password: YES)")
Looks like a DB error related to the migration of tm-stage01-master01. Punting to mpressman, who's been taking care of these.
Assignee: server-ops → server-ops-database
Component: Server Operations: Web Operations → Server Operations: Database
Comment 3•13 years ago
|
||
what database is this connecting to? input_stage_mozilla_com has the updated permissions using the user 'input_user' - There is also a database called input_mozilla_com, however this doesn't have any any users associated with it.
Comment 4•13 years ago
|
||
Mana says input_stage should be using phx1 database with IP 10.8.70.87-88.
Comment 5•13 years ago
|
||
Jake, can you look at the settings_local for Input and figure out where it's pointing? Mana is really unclear here, it says that input.allizom.org is both on mrapp-stage02 and on seamicros in phx1.
Comment 6•13 years ago
|
||
input.allizom.org runs on mrapp-stage02 currently, although I'm pretty sure there's a bug to move it to phx1, where the production gear is.
Here's it's DB connection info:
'NAME': 'input_stage_mozilla_com',
'ENGINE': 'django.db.backends.mysql',
'HOST': '10.2.70.130',
#'PORT': '',
'USER': 'input_user',
Comment 7•13 years ago
|
||
Interesting, it seems like we have all the servers down there already.
Well, Matt, is that enough info to move forward from comment 3?
Comment 8•13 years ago
|
||
I'll poke around and see what I can do. I'm surprised this stopped working, especially since we didn't move it. I'll post my results here shortly
Comment 9•13 years ago
|
||
The connection info looks incorrect, the Host should be 10.2.10.103 rather than 10.2.70.130
Comment 10•13 years ago
|
||
Ah, I think I found the issue.
The master works fine (verified by hand). It cannot connect to the *slave* database server.
'slave': {
'NAME': 'input_stage_mozilla_com',
'ENGINE': 'django.db.backends.mysql',
'HOST': '10.2.70.131',
'USER': 'input_user',
[root@mrapp-stage02 settings]# mysql -h 10.2.70.131 -u input_user -p
Enter password:
ERROR 1045 (28000): Access denied for user 'input_user'@'10.2.10.103' (using password: YES)
Looking at the stage DB slave server (10.22.70.40), I don't see a grant for this user/host combo. That's probably what's up.
Comment 11•13 years ago
|
||
ok, input_user now has access to the slave, however, there the error: Sphinx threw an unknown exception:
Comment 12•13 years ago
|
||
Moving this back out of DBA-land... thanks Matt!
The sphinx issue appears to be due to bug 726885. The sphinx hosts in SJC1 appear to have been shut down. I'm told a local sphinx installation was set up right on mrapp-stage02... trying that now.
Assignee: mpressman → nmaul
Component: Server Operations: Database → Server Operations: Web Operations
Comment 13•13 years ago
|
||
There is indeed a local sphinx daemon running, and this is now pointed at it.
However, that page is now throwing a simple "Query timed out" error. It returns fairly quickly, and I can watch "top" while loading the page and see that the "searchd" daemon does briefly use up some CPU time, so *something* is getting done. Not sure how to troubleshoot this further.
I'm wondering if maybe the timeout is just exceptionally low, and this system takes a bit longer to return. Is that a tunable somewhere we can adjust?
Comment 14•13 years ago
|
||
When you say "fairly quickly" do you mean like 10 seconds? or 1 second?
And timeouts - sphinx timeout? zeus timeout maybe?
Comment 15•13 years ago
|
||
As quickly as 1-2 seconds.
"Query has timed out." is the entirety of the error message. I don't really know what type of query is timing out, or what it's trying to access. It actually seems more likely that this error message is completely incorrect, and it's actually getting something more like a connection error of sorts.
Comment 16•13 years ago
|
||
hrm, it looks like Matt gave permissions to the master and slave db last week, so I am not confident it's a connection issue....let me know if you'd like me to help debug further (just going through last week's mail, since I was out)
Comment 17•13 years ago
|
||
The only place I see this in the codebase is in apps/search/client.py:
try:
results = sc.RunQueries()
except socket.timeout:
statsd.incr('sphinx.errors.timeout')
raise SearchError(_("Query has timed out."))
So this appears to be a generic 'socket.timeout' error... presumably a Sphinx socket, but that's not 100% obvious to me (I don't know what all sc.RunQueries() does for sure... it might be running *and storing*, for example).
I grepped around the source a bit and found this line:
vendor/src/sphinxapi/sphinxapi/__init__.py:K_TIMEOUT = 1 # Socket timeout in seconds
I increased this manually up to 5 seconds, and the page now loads properly! I'm guessing this host is just a bit slower and needs a little longer to respond properly.
Kicking this over to web dev to implement this (or some other fix) in the upstream repo... my local change won't persist.
Assignee: nmaul → nobody
Component: Server Operations: Web Operations → Search
Product: mozilla.org → Input
QA Contact: cshields → search
Comment 18•13 years ago
|
||
It's definitely a sphinx query error. There's nothing else in RunQueries() that uses a socket.
Changing the code is a very broad thing that will affect multiple sites (because it's a 3rd party library). We can do it once we check the impact, but I'm also curious what kind of box this host is. Sphinx keeps everything it can in memory, so a 1-second time out should be more than enough.
Comment 19•13 years ago
|
||
This server is a single box, and hosts many staging sites. On top of that, it is an older system, comparatively: HP DL360 G4. It seems reasonable to me that it's just taking slightly longer to return than the old systems did. A 2-second timeout might be sufficient.
It might also be that it *can't* keep everything in memory. Do we know what the dataset sites for input.allizom.org is? Perhaps it's having to go to disk a little bit.
Comment 20•13 years ago
|
||
(In reply to Jake Maul [:jakem] from comment #19)
> This server is a single box, and hosts many staging sites. On top of that,
> it is an older system, comparatively: HP DL360 G4. It seems reasonable to me
> that it's just taking slightly longer to return than the old systems did. A
> 2-second timeout might be sufficient.
Is that the box Sphinx is on, or the webapp?
> It might also be that it *can't* keep everything in memory. Do we know what
> the dataset sites for input.allizom.org is? Perhaps it's having to go to
> disk a little bit.
Sphinx only stores a big bucket of integers in memory, it's pretty compact.
Comment 21•13 years ago
|
||
(In reply to James Socol [:jsocol, :james] from comment #20)
> Is that the box Sphinx is on, or the webapp?
Both... it's using a local install of Sphinx.
> > It might also be that it *can't* keep everything in memory. Do we know what
> > the dataset sites for input.allizom.org is? Perhaps it's having to go to
> > disk a little bit.
>
> Sphinx only stores a big bucket of integers in memory, it's pretty compact.
The system has 4GB of RAM total, and hosts lots of sites beyond just input.allizom.org. Is there a "cache size" tunable somewhere?
Comment 22•13 years ago
|
||
Please note, I was able to generate a "Query has timed out" error in *production* input.mozilla.org just now, simply by visiting the page and changing the Firefox version dropdown to "all". Upon trying again, it worked.
So then I checked just en-US, and got the error again... and no amount of reloading or trying again seems sufficient to make it work.
Here's the search that repeatably triggers this error:
http://input.mozilla.org/en-US/?q=&product=firefox&version=--&date_start=&date_end=&locale=en-US
In light of this breaking production as well as staging, can we try increasing this timeout?
Comment 23•13 years ago
|
||
This is still very easy to replicate even in prod, simply by choosing a version and a platform, or a version and a locale.
https://input.mozilla.org/en-US/?q=&product=firefox&version=14.0a2&date_start=&date_end=&locale=en-US
That link never works for me. It always generates a "Query has timed out" error.
Comment 24•13 years ago
|
||
I have a pull request open for changing that timeout, hopefully, we can fix this once that merges in.
Comment 25•11 years ago
|
||
Should be fixed by search perf improvements in new input.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
Assignee | ||
Updated•8 years ago
|
Product: Input → Input Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•