Last Comment Bug 493053 - Please fix MDC timeouts and server resets
: Please fix MDC timeouts and server resets
Status: RESOLVED FIXED
ETA 05/26
:
Product: mozilla.org Graveyard
Classification: Graveyard
Component: Server Operations (show other bugs)
: other
: All Other
: -- blocker (vote)
: ---
Assigned To: Dave Miller [:justdave] (justdave@bugzilla.org)
: matthew zeier [:mrz]
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-05-14 12:34 PDT by Eric Shepherd [:sheppy]
Modified: 2015-03-12 08:17 PDT (History)
2 users (show)
justdave: needs‑downtime+
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description Eric Shepherd [:sheppy] 2009-05-14 12:34:20 PDT
MDC has been timing out and experiencing web server resets or the like all day, and is generally misbehaving. We need to get this checked out, as it appears to be causing edits to get borked.
Comment 1 Reed Loden [:reed] (use needinfo?) 2009-05-14 12:35:22 PDT
wikimo has this same issue... I personally blame the netscaler, but that's unfounded and just my personal guess. :)
Comment 2 matthew zeier [:mrz] 2009-05-14 12:48:48 PDT
bugzilla, also behind the netscaler, isn't showing issues.  What's common between MDC and wikimo?
Comment 3 Reed Loden [:reed] (use needinfo?) 2009-05-14 12:53:02 PDT
Same database server is about the only connection besides the netscaler.
Comment 4 Jeremy Orem [:oremj] 2009-05-14 13:49:21 PDT
The common, and typically offending, factor is tm-c01-master01.  Usually the culprit is reporter or sfx.
Comment 5 Eric Shepherd [:sheppy] 2009-05-14 14:15:54 PDT
Is there any way to isolate those guys from MDC and/or wikimo? This sort of thing is happening way too often and is beating the hell out of people's productivity when it does.
Comment 6 matthew zeier [:mrz] 2009-05-14 14:19:06 PDT
Dave, where's the right db cluster for these?
Comment 7 Eric Shepherd [:sheppy] 2009-05-18 07:58:26 PDT
This problem has started happening again, by the way. Had cleared up for a couple days but it's back. This is making working with MDC very difficult.
Comment 8 Dave Miller [:justdave] (justdave@bugzilla.org) 2009-05-18 11:01:05 PDT
MDC is at least a B-level service.  The right answer is to get it off C01.  B01 might be a good place for it.  It would be sharing resources with bonsai, tinderbox, and friends (and those are typically kind to the DB -- they just hose the front end webservers, which are separate). I would suspect that's not happening without an outage window, but if Sheppy wants to take a half hour hit now in order to clear it up for good we probably could.

SFX and Reporter arguably should get their own DB cluster, not for SLA reasons, but because they tend to hose everything else, and it's not fair to the others.
Comment 9 Eric Shepherd [:sheppy] 2009-05-18 11:38:30 PDT
I'd happily take a hit whenever if it would help reduce the impact of other services on MDC's performance. Today things aren't as bad as they were the other day (I'm only currently getting occasional timeouts, instead of timing out on nearly every request), but it's still pretty frustrating.
Comment 10 Dave Miller [:justdave] (justdave@bugzilla.org) 2009-05-20 21:53:44 PDT
OK, we're going to move this into the B01 cluster during Thursday's outage window.
Comment 11 Dave Miller [:justdave] (justdave@bugzilla.org) 2009-05-20 21:56:11 PDT
It would be helpful if sometime before then https://intranet.mozilla.org/Developer.mozilla.org could be updated to tell me where to find the database configuration on the webheads so I can change it to point at the new DB server at the appropriate time.
Comment 12 Dave Miller [:justdave] (justdave@bugzilla.org) 2009-05-24 15:49:04 PDT
Moving this to Tuesday since we missed getting an outage notice sent out for Thursday.  Sheppy or oremj: I could still use an answer for comment 11 before we do this.
Comment 13 Eric Shepherd [:sheppy] 2009-05-24 16:54:55 PDT
I don't know anything about the database configuration stuff, so that'd be for oremj to do.
Comment 14 Dave Miller [:justdave] (justdave@bugzilla.org) 2009-05-24 18:14:58 PDT
(In reply to comment #13)
> I don't know anything about the database configuration stuff, so that'd be for
> oremj to do.

Well, I know you've set up staging instances of it on your own boxes before, so that's why I included you in that request, since I presume the config files would probably be in the same place relative to the docroot or something.  I can probably figure out what to change once I figure out where they are.
Comment 15 Eric Shepherd [:sheppy] 2009-05-24 18:21:18 PDT
Ah, no, at home I've just used the stock Deki VM with copies of our database and attachments plastered in. I don't know where the stuff is on the real server.
Comment 16 Dave Miller [:justdave] (justdave@bugzilla.org) 2009-05-24 19:04:38 PDT
Reed found the config file, the wiki page is updated.  So we're all ready to go for Tuesday night.
Comment 17 Dave Miller [:justdave] (justdave@bugzilla.org) 2009-05-24 19:11:10 PDT
Estimated downtime is about 10 to 15 minutes (5 minutes to dump the DB, 5 to 10 minutes to restore it on the new db cluster).  I'd put an hour on the downtime notice just to play it safe.
Comment 18 Dave Miller [:justdave] (justdave@bugzilla.org) 2009-05-26 19:18:42 PDT
took site down at 7:03pm

dump on c01:
real    1m31.475s
user    0m30.771s
sys     0m8.513s

copy to b01:
1182MB  28.1MB/s   00:42    

import on b01:
real    2m5.265s
user    0m16.384s
sys     0m2.239s

application restarted at 7:14pm

And it's dead.

And the error doesn't make any sense, because I can connect as that user to the new DB server from both deki webheads.  My only guess is there's more config to change somewhere besides that LocalSettings.php file...
Comment 19 Eric Shepherd [:sheppy] 2009-05-26 19:38:28 PDT
Not something I can help with; you'll need to talk to oremj if you're not already.
Comment 20 Dave Miller [:justdave] (justdave@bugzilla.org) 2009-05-26 20:17:44 PDT
yeah, have been, and on the dead silent #mindtouch on IRC, too.

But we got it working finally... turns out the stored procs have the ACL of the definer coded in them so they run as the user that created them.  Except said user doesn't exist.

The fix turned out to be:
- mysqldump --routines --no-data dbname > file
- open file, remove all the tables so it's only procs
- s/oldip/newip/ in the definers
- mysql dbname < file
Comment 21 Eric Shepherd [:sheppy] 2009-05-26 20:46:56 PDT
For future reference, the MindTouch guys say this issue is resolved in the 9.02.2 release we're installing Thursday. They've been moving away from the use of stored procs.
Comment 22 Dave Miller [:justdave] (justdave@bugzilla.org) 2009-05-26 20:50:36 PDT
Actually, what I saw them say was they were removing the definer restrictions from the stored procs in the next version.
Comment 23 Eric Shepherd [:sheppy] 2009-05-26 20:52:04 PDT
Yeah, that's a step in a longer-term process, at least according to past conversations I've had with them.
Comment 24 Eric Shepherd [:sheppy] 2009-05-28 07:51:32 PDT
This problem is still happening (or at least, something that looks exactly the same is happening). What's up?
Comment 25 Dave Miller [:justdave] (justdave@bugzilla.org) 2009-05-28 07:56:58 PDT
(In reply to comment #24)
> This problem is still happening (or at least, something that looks exactly the
> same is happening). What's up?

Bug 495116, which is still open.

Note You need to log in before you can comment on or make changes to this bug.