Note: There are a few cases of duplicates in user autocompletion which are being worked on.

Please fix MDC timeouts and server resets

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
--
blocker
RESOLVED FIXED
8 years ago
2 years ago

People

(Reporter: sheppy, Assigned: justdave)

Tracking

Bug Flags:
needs-downtime +

Details

(Whiteboard: ETA 05/26)

(Reporter)

Description

8 years ago
MDC has been timing out and experiencing web server resets or the like all day, and is generally misbehaving. We need to get this checked out, as it appears to be causing edits to get borked.
wikimo has this same issue... I personally blame the netscaler, but that's unfounded and just my personal guess. :)

Comment 2

8 years ago
bugzilla, also behind the netscaler, isn't showing issues.  What's common between MDC and wikimo?
Component: Server Operations: Web Content Push → Server Operations
Summary: Pleae fix MDC timeouts and server resets → Please fix MDC timeouts and server resets
Same database server is about the only connection besides the netscaler.

Comment 4

8 years ago
The common, and typically offending, factor is tm-c01-master01.  Usually the culprit is reporter or sfx.
(Reporter)

Comment 5

8 years ago
Is there any way to isolate those guys from MDC and/or wikimo? This sort of thing is happening way too often and is beating the hell out of people's productivity when it does.

Comment 6

8 years ago
Dave, where's the right db cluster for these?
Assignee: server-ops → justdave
(Reporter)

Comment 7

8 years ago
This problem has started happening again, by the way. Had cleared up for a couple days but it's back. This is making working with MDC very difficult.
Severity: critical → blocker
MDC is at least a B-level service.  The right answer is to get it off C01.  B01 might be a good place for it.  It would be sharing resources with bonsai, tinderbox, and friends (and those are typically kind to the DB -- they just hose the front end webservers, which are separate). I would suspect that's not happening without an outage window, but if Sheppy wants to take a half hour hit now in order to clear it up for good we probably could.

SFX and Reporter arguably should get their own DB cluster, not for SLA reasons, but because they tend to hose everything else, and it's not fair to the others.
(Reporter)

Comment 9

8 years ago
I'd happily take a hit whenever if it would help reduce the impact of other services on MDC's performance. Today things aren't as bad as they were the other day (I'm only currently getting occasional timeouts, instead of timing out on nearly every request), but it's still pretty frustrating.

Updated

8 years ago
Whiteboard: ETA 05/26

Updated

8 years ago
Whiteboard: ETA 05/26 → ETA 05/21
OK, we're going to move this into the B01 cluster during Thursday's outage window.
Flags: needs-downtime+
It would be helpful if sometime before then https://intranet.mozilla.org/Developer.mozilla.org could be updated to tell me where to find the database configuration on the webheads so I can change it to point at the new DB server at the appropriate time.
Moving this to Tuesday since we missed getting an outage notice sent out for Thursday.  Sheppy or oremj: I could still use an answer for comment 11 before we do this.
Whiteboard: ETA 05/21 → ETA 05/26
(Reporter)

Comment 13

8 years ago
I don't know anything about the database configuration stuff, so that'd be for oremj to do.
(In reply to comment #13)
> I don't know anything about the database configuration stuff, so that'd be for
> oremj to do.

Well, I know you've set up staging instances of it on your own boxes before, so that's why I included you in that request, since I presume the config files would probably be in the same place relative to the docroot or something.  I can probably figure out what to change once I figure out where they are.
(Reporter)

Comment 15

8 years ago
Ah, no, at home I've just used the stock Deki VM with copies of our database and attachments plastered in. I don't know where the stuff is on the real server.
Reed found the config file, the wiki page is updated.  So we're all ready to go for Tuesday night.
Estimated downtime is about 10 to 15 minutes (5 minutes to dump the DB, 5 to 10 minutes to restore it on the new db cluster).  I'd put an hour on the downtime notice just to play it safe.
took site down at 7:03pm

dump on c01:
real    1m31.475s
user    0m30.771s
sys     0m8.513s

copy to b01:
1182MB  28.1MB/s   00:42    

import on b01:
real    2m5.265s
user    0m16.384s
sys     0m2.239s

application restarted at 7:14pm

And it's dead.

And the error doesn't make any sense, because I can connect as that user to the new DB server from both deki webheads.  My only guess is there's more config to change somewhere besides that LocalSettings.php file...
(Reporter)

Comment 19

8 years ago
Not something I can help with; you'll need to talk to oremj if you're not already.
yeah, have been, and on the dead silent #mindtouch on IRC, too.

But we got it working finally... turns out the stored procs have the ACL of the definer coded in them so they run as the user that created them.  Except said user doesn't exist.

The fix turned out to be:
- mysqldump --routines --no-data dbname > file
- open file, remove all the tables so it's only procs
- s/oldip/newip/ in the definers
- mysql dbname < file
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
(Reporter)

Comment 21

8 years ago
For future reference, the MindTouch guys say this issue is resolved in the 9.02.2 release we're installing Thursday. They've been moving away from the use of stored procs.
Actually, what I saw them say was they were removing the definer restrictions from the stored procs in the next version.
(Reporter)

Comment 23

8 years ago
Yeah, that's a step in a longer-term process, at least according to past conversations I've had with them.
(Reporter)

Comment 24

8 years ago
This problem is still happening (or at least, something that looks exactly the same is happening). What's up?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #24)
> This problem is still happening (or at least, something that looks exactly the
> same is happening). What's up?

Bug 495116, which is still open.
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago8 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.