Closed Bug 587396 Opened 15 years ago Closed 14 years ago

investigate inconsistent session expiration between master and slaves in sumo cluster

Categories

(mozilla.org Graveyard :: Server Operations, task)

All
Other
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: justdave, Assigned: justdave)

References

Details

I truncated the sessions table in the support_mozilla_com database tonight on the master and all slaves, manually, because the performance was horrible trying to insert new sessions into an 83 GB table. Something's not working right with expiring old sessions. The master database had 411 MB in the sessions table. Slave01 had 83 GB (yes, that's a G) and slave02 had 36 GB. So obviously whatever query is being used to expire old sessions isn't doing it consistently between the master and the slaves.
Assignee: server-ops → tellis
Group: infra
I just truncate this again. 58 GB on slave01, 50 GB on slave02, and 411 MB on the master. slaves were getting laggy, had to clear the sessions table to get them to catch up.
Severity: minor → major
Assignee: tellis → justdave
The query it's running to remove stale sessions from the database is: DELETE FROM sessions WHERE expiry < UNIX_TIMESTAMP() LIMIT 300 That always seems to be the query that's backed up on the slaves when they get behind like this, too.
Tim said he found the script or cron running this query. AFAICT it's not part of the SUMO code. Also worth noting that the sessions table has "MAX_ROWS 20000" set, at least when I import production dumps.
(In reply to comment #3) > Also worth noting that the sessions table has "MAX_ROWS 20000" set, at least > when I import production dumps. That's a legacy name, and I don't think MySQL actually uses it for that anymore. That number times the MAX_ROW_SIZE is the maximum size of the table itself, but I don't think it actually caps the number of rows anymore.
Dave is correct. The max_rows directive does nothing like it would suggest. You can (and often do) have far more than 20k rows in the table.
So we partially resolved this last week by removing some non-replication-safe SQL from a cron job that was trying to clean the sessions table. Based on the bug I just duped it looks like we still need to free up some innodb storage that it left allocated. I reloaded the slaves last week with that intent, but didn't make sure they were cleaned up on the backup server first, and apparently they weren't. So we need to reload the DB from a dump file on the backup server and then push it back out to the slaves one at a time.
Whiteboard: [needs automation]
ok, the real story here after lots of further diagnosis is that the slaves are not doing garbage collection on the sessions table for some reason, even though the master apparently is. The row mismatch wasn't with the number of live rows in the table, it was with the number of rows allocated (most of which were marked deleted) in the data file. I now have staggered cron jobs in place on all of the slaves which do our garbage collection ("optimize table sessions") once an hour. If done hourly, this process takes between 2 and 4 minutes, and will cause replication to lag while it runs, because it locks the table. If left until we have problems it takes about a half hour to run. The above is a workaround. Further troubleshooting this morning with rsoderberg's help discovered that there's a connection from the support_ro user which is holding a transaction open for 20 to 30 minutes at a time, sitting there sleeping most of that time. The open transaction prevents the transaction history purge thread from being able to complete. We plan to split the traffic between the php and python halves of the app either to different user IDs or different load balancing vIPs during tonight's outage window to help isolate it.
Whiteboard: [needs automation] → [blocked]
(In reply to comment #8) > We plan to split > the traffic between the php and python halves of the app either to different > user IDs or different load balancing vIPs during tonight's outage window to > help isolate it. Opted for different userIDs, that was pushed with bug 607964. Verdict is: it's the PHP half that's holding the open transactions. This half of the app is going away in two weeks anyway, which means there's no further action to do here. The problem will solve itself when the PHP portion of the app is decomissioned.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Whiteboard: [blocked]
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.