Closed Bug 571497 Opened 9 years ago Closed 9 years ago

Rebuild SUMO database cluster.

Categories

(mozilla.org Graveyard :: Server Operations, task, blocker)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jsocol, Assigned: tellis)

Details

Attachments

(1 file)

We need to figure out why the SUMO sessions table decided to grow so quickly, and come up with a plan to return disk space to the filesystem.

Tim, I'm not sure we ever really found out what was up with the session garbage collection query we were hunting for. Is it possible it got turned off?

Shyam, is it possible PHP's session.auto_start[1] is turned on on some of the generic boxes?

[1] http://www.php.net/manual/en/session.configuration.php#ini.session.auto-start
(In reply to comment #0)
> Shyam, is it possible PHP's session.auto_start[1] is turned on on some of the
> generic boxes?

[root@mradm02 wiki.mozilla.org]# /data/bin/issue-multi-command.py generic 'grep session.auto_start /etc/php.ini'
===pm-app-generic03===
session.auto_start = 0

===pm-app-generic04===
session.auto_start = 0

===pm-app-generic01===
session.auto_start = 0

===pm-app-generic06===
session.auto_start = 0

===pm-app-generic05===
session.auto_start = 0

===pm-app-generic02===
session.auto_start = 0
Assignee: server-ops → shyam
Passing to Tim, this has more to do with the DB than anything else.
Assignee: shyam → tellis
Tim, I've noticed that replication lag (in munin) has been a lot less stable since sometime Tuesday morning. I wonder if that's related?
The cron job is still there (just checked it yesterday, actually). I'll ensure it's still running.
I cannot manage to delete any sessions without getting a lock wait timeout. I am going to try truncating the sessions table to see if that will get it back on track.
So the sessions table filled up. We don't have innodb file_per_table enabled on the cluster. The sessions table can't be emptied via normal means, because something is wrong with it.

There are a number of things pointing to that this cluster just needs to be rebuilt. This will take a number of hours, and so should happen in the next outage window.

Here's the general plan:

(1) Truncate the sessions table.
(2) mysqldump the whole support_mozilla_com database on master.
(3) Copy the mysqldump to all slaves.
(4) Drop the database on master and all slaves.
(5) Recreate the database on master and all slaves.
(6) Do all this in Phoenix.

The total time I estimate as about 3 hours.
Severity: critical → blocker
Summary: Investigate SUMO session spike → Rebuild SUMO database cluster.
Bouncing mysqld cleared up the horked lock problem on sessions, so we can delete them. That doesn't fix the InnoDB space-taken problem, though, so we still need this outage. Also, it will fix the schemas being out-of-sync between master and slaves.
Is this outage still planned for Tuesday night? If so is it in the downtime notice for Tuesday?
Best time for this outage: ~ 7 PM pacific in terms of minimal traffic.
Outage over. I'll check up in the morn to see if things are progressing well.
Aftermath: database sizes much reduced. Failed to get innodb_file_per_table done in the outage window. Sessions no longer piling up (they were at 9M before the outage, and about 7k now, and I see them growing/shrinking as the sessions delete script does its job).

It seems this is a win. I'm resolving fixed.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
(In reply to comment #12)
> It seems this is a win. I'm resolving fixed.

Great, thanks Tim! Does this also resolve the schema bug?
It does. Do you want me to mysqldump out a schema from anywhere to help you verify that?
(In reply to comment #14)
> It does. Do you want me to mysqldump out a schema from anywhere to help you
> verify that?

Sure, if you could. Attach it to the schema-fixing bug?
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.