Closed Bug 744159 Opened 13 years ago Closed 11 years ago

Input staging is down/returning a 500 Internal Server Error

Categories

(Input Graveyard :: Search, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: stephend, Unassigned)

References

()

Details

http://input.allizom.org/ is down, returning a 500: [13:39:37.347] GET http://input.allizom.org/ [HTTP/1.1 500 Internal Server Error 59ms]
OperationalError: (1045, "Access denied for user 'input_user'@'10.2.10.103' (using password: YES)") Looks like a DB error related to the migration of tm-stage01-master01. Punting to mpressman, who's been taking care of these.
Assignee: server-ops → server-ops-database
Component: Server Operations: Web Operations → Server Operations: Database
assigning to mpressman.
Assignee: server-ops-database → mpressman
what database is this connecting to? input_stage_mozilla_com has the updated permissions using the user 'input_user' - There is also a database called input_mozilla_com, however this doesn't have any any users associated with it.
Mana says input_stage should be using phx1 database with IP 10.8.70.87-88.
Jake, can you look at the settings_local for Input and figure out where it's pointing? Mana is really unclear here, it says that input.allizom.org is both on mrapp-stage02 and on seamicros in phx1.
input.allizom.org runs on mrapp-stage02 currently, although I'm pretty sure there's a bug to move it to phx1, where the production gear is. Here's it's DB connection info: 'NAME': 'input_stage_mozilla_com', 'ENGINE': 'django.db.backends.mysql', 'HOST': '10.2.70.130', #'PORT': '', 'USER': 'input_user',
Interesting, it seems like we have all the servers down there already. Well, Matt, is that enough info to move forward from comment 3?
I'll poke around and see what I can do. I'm surprised this stopped working, especially since we didn't move it. I'll post my results here shortly
The connection info looks incorrect, the Host should be 10.2.10.103 rather than 10.2.70.130
Ah, I think I found the issue. The master works fine (verified by hand). It cannot connect to the *slave* database server. 'slave': { 'NAME': 'input_stage_mozilla_com', 'ENGINE': 'django.db.backends.mysql', 'HOST': '10.2.70.131', 'USER': 'input_user', [root@mrapp-stage02 settings]# mysql -h 10.2.70.131 -u input_user -p Enter password: ERROR 1045 (28000): Access denied for user 'input_user'@'10.2.10.103' (using password: YES) Looking at the stage DB slave server (10.22.70.40), I don't see a grant for this user/host combo. That's probably what's up.
ok, input_user now has access to the slave, however, there the error: Sphinx threw an unknown exception:
Moving this back out of DBA-land... thanks Matt! The sphinx issue appears to be due to bug 726885. The sphinx hosts in SJC1 appear to have been shut down. I'm told a local sphinx installation was set up right on mrapp-stage02... trying that now.
Assignee: mpressman → nmaul
Component: Server Operations: Database → Server Operations: Web Operations
There is indeed a local sphinx daemon running, and this is now pointed at it. However, that page is now throwing a simple "Query timed out" error. It returns fairly quickly, and I can watch "top" while loading the page and see that the "searchd" daemon does briefly use up some CPU time, so *something* is getting done. Not sure how to troubleshoot this further. I'm wondering if maybe the timeout is just exceptionally low, and this system takes a bit longer to return. Is that a tunable somewhere we can adjust?
When you say "fairly quickly" do you mean like 10 seconds? or 1 second? And timeouts - sphinx timeout? zeus timeout maybe?
As quickly as 1-2 seconds. "Query has timed out." is the entirety of the error message. I don't really know what type of query is timing out, or what it's trying to access. It actually seems more likely that this error message is completely incorrect, and it's actually getting something more like a connection error of sorts.
hrm, it looks like Matt gave permissions to the master and slave db last week, so I am not confident it's a connection issue....let me know if you'd like me to help debug further (just going through last week's mail, since I was out)
The only place I see this in the codebase is in apps/search/client.py: try: results = sc.RunQueries() except socket.timeout: statsd.incr('sphinx.errors.timeout') raise SearchError(_("Query has timed out.")) So this appears to be a generic 'socket.timeout' error... presumably a Sphinx socket, but that's not 100% obvious to me (I don't know what all sc.RunQueries() does for sure... it might be running *and storing*, for example). I grepped around the source a bit and found this line: vendor/src/sphinxapi/sphinxapi/__init__.py:K_TIMEOUT = 1 # Socket timeout in seconds I increased this manually up to 5 seconds, and the page now loads properly! I'm guessing this host is just a bit slower and needs a little longer to respond properly. Kicking this over to web dev to implement this (or some other fix) in the upstream repo... my local change won't persist.
Assignee: nmaul → nobody
Component: Server Operations: Web Operations → Search
Product: mozilla.org → Input
QA Contact: cshields → search
It's definitely a sphinx query error. There's nothing else in RunQueries() that uses a socket. Changing the code is a very broad thing that will affect multiple sites (because it's a 3rd party library). We can do it once we check the impact, but I'm also curious what kind of box this host is. Sphinx keeps everything it can in memory, so a 1-second time out should be more than enough.
This server is a single box, and hosts many staging sites. On top of that, it is an older system, comparatively: HP DL360 G4. It seems reasonable to me that it's just taking slightly longer to return than the old systems did. A 2-second timeout might be sufficient. It might also be that it *can't* keep everything in memory. Do we know what the dataset sites for input.allizom.org is? Perhaps it's having to go to disk a little bit.
(In reply to Jake Maul [:jakem] from comment #19) > This server is a single box, and hosts many staging sites. On top of that, > it is an older system, comparatively: HP DL360 G4. It seems reasonable to me > that it's just taking slightly longer to return than the old systems did. A > 2-second timeout might be sufficient. Is that the box Sphinx is on, or the webapp? > It might also be that it *can't* keep everything in memory. Do we know what > the dataset sites for input.allizom.org is? Perhaps it's having to go to > disk a little bit. Sphinx only stores a big bucket of integers in memory, it's pretty compact.
(In reply to James Socol [:jsocol, :james] from comment #20) > Is that the box Sphinx is on, or the webapp? Both... it's using a local install of Sphinx. > > It might also be that it *can't* keep everything in memory. Do we know what > > the dataset sites for input.allizom.org is? Perhaps it's having to go to > > disk a little bit. > > Sphinx only stores a big bucket of integers in memory, it's pretty compact. The system has 4GB of RAM total, and hosts lots of sites beyond just input.allizom.org. Is there a "cache size" tunable somewhere?
Please note, I was able to generate a "Query has timed out" error in *production* input.mozilla.org just now, simply by visiting the page and changing the Firefox version dropdown to "all". Upon trying again, it worked. So then I checked just en-US, and got the error again... and no amount of reloading or trying again seems sufficient to make it work. Here's the search that repeatably triggers this error: http://input.mozilla.org/en-US/?q=&product=firefox&version=--&date_start=&date_end=&locale=en-US In light of this breaking production as well as staging, can we try increasing this timeout?
This is still very easy to replicate even in prod, simply by choosing a version and a platform, or a version and a locale. https://input.mozilla.org/en-US/?q=&product=firefox&version=14.0a2&date_start=&date_end=&locale=en-US That link never works for me. It always generates a "Query has timed out" error.
I have a pull request open for changing that timeout, hopefully, we can fix this once that merges in.
Should be fixed by search perf improvements in new input.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
Product: Input → Input Graveyard
You need to log in before you can comment on or make changes to this bug.