MDC has been timing out and experiencing web server resets or the like all day, and is generally misbehaving. We need to get this checked out, as it appears to be causing edits to get borked.
wikimo has this same issue... I personally blame the netscaler, but that's unfounded and just my personal guess. :)
bugzilla, also behind the netscaler, isn't showing issues. What's common between MDC and wikimo?
Same database server is about the only connection besides the netscaler.
The common, and typically offending, factor is tm-c01-master01. Usually the culprit is reporter or sfx.
Is there any way to isolate those guys from MDC and/or wikimo? This sort of thing is happening way too often and is beating the hell out of people's productivity when it does.
Dave, where's the right db cluster for these?
This problem has started happening again, by the way. Had cleared up for a couple days but it's back. This is making working with MDC very difficult.
MDC is at least a B-level service. The right answer is to get it off C01. B01 might be a good place for it. It would be sharing resources with bonsai, tinderbox, and friends (and those are typically kind to the DB -- they just hose the front end webservers, which are separate). I would suspect that's not happening without an outage window, but if Sheppy wants to take a half hour hit now in order to clear it up for good we probably could.
SFX and Reporter arguably should get their own DB cluster, not for SLA reasons, but because they tend to hose everything else, and it's not fair to the others.
I'd happily take a hit whenever if it would help reduce the impact of other services on MDC's performance. Today things aren't as bad as they were the other day (I'm only currently getting occasional timeouts, instead of timing out on nearly every request), but it's still pretty frustrating.
OK, we're going to move this into the B01 cluster during Thursday's outage window.
It would be helpful if sometime before then https://intranet.mozilla.org/Developer.mozilla.org could be updated to tell me where to find the database configuration on the webheads so I can change it to point at the new DB server at the appropriate time.
Moving this to Tuesday since we missed getting an outage notice sent out for Thursday. Sheppy or oremj: I could still use an answer for comment 11 before we do this.
I don't know anything about the database configuration stuff, so that'd be for oremj to do.
(In reply to comment #13)
> I don't know anything about the database configuration stuff, so that'd be for
> oremj to do.
Well, I know you've set up staging instances of it on your own boxes before, so that's why I included you in that request, since I presume the config files would probably be in the same place relative to the docroot or something. I can probably figure out what to change once I figure out where they are.
Ah, no, at home I've just used the stock Deki VM with copies of our database and attachments plastered in. I don't know where the stuff is on the real server.
Reed found the config file, the wiki page is updated. So we're all ready to go for Tuesday night.
Estimated downtime is about 10 to 15 minutes (5 minutes to dump the DB, 5 to 10 minutes to restore it on the new db cluster). I'd put an hour on the downtime notice just to play it safe.
took site down at 7:03pm
dump on c01:
copy to b01:
1182MB 28.1MB/s 00:42
import on b01:
application restarted at 7:14pm
And it's dead.
And the error doesn't make any sense, because I can connect as that user to the new DB server from both deki webheads. My only guess is there's more config to change somewhere besides that LocalSettings.php file...
Not something I can help with; you'll need to talk to oremj if you're not already.
yeah, have been, and on the dead silent #mindtouch on IRC, too.
But we got it working finally... turns out the stored procs have the ACL of the definer coded in them so they run as the user that created them. Except said user doesn't exist.
The fix turned out to be:
- mysqldump --routines --no-data dbname > file
- open file, remove all the tables so it's only procs
- s/oldip/newip/ in the definers
- mysql dbname < file
For future reference, the MindTouch guys say this issue is resolved in the 9.02.2 release we're installing Thursday. They've been moving away from the use of stored procs.
Actually, what I saw them say was they were removing the definer restrictions from the stored procs in the next version.
Yeah, that's a step in a longer-term process, at least according to past conversations I've had with them.
This problem is still happening (or at least, something that looks exactly the same is happening). What's up?
(In reply to comment #24)
> This problem is still happening (or at least, something that looks exactly the
> same is happening). What's up?
Bug 495116, which is still open.