Closed Bug 1223226 Opened 9 years ago Closed 8 years ago

Investigate improvements to purge_ttl backend sync daemon.

Categories

(Cloud Services Graveyard :: Server: Sync, defect, P3)

defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: bobm, Assigned: bobm)

References

Details

Sync databases tend get overwhelmed after accruing a certain number of records by the TTL expiration job from the purge_ttl.py daemon script.  Investigate running this job in a more friendly matter.  Some options include:

* Use MariaDB query limits.  Note, not all limits available in 5.5.X MariaDB.  See: https://mariadb.com/kb/en/mariadb/query-limits-and-timeouts/

* Make use of occasional ONLINE OPTIMIZE TABLE jobs to clean up the table indexes before running TTL expiration queries.

* Limit purge TTL SQL jobs to lower traffic periods.

* Make use of purge TTL stats to improve queries.  See: 976387
(In reply to Bob Micheletto [:bobm] from comment #0)
 
> * Make use of occasional ONLINE OPTIMIZE TABLE jobs to clean up the table
> indexes before running TTL expiration queries.

See: https://www.percona.com/blog/2015/02/11/tokudb-table-optimization-improvements/
We could also try mucking around with the transactions and isolation level to avoid the purge script taking too many locks.

I noticed that the purge script calls the backend method purge_expired_items, which (simplified) looks like this:

    @with_session
    def purge_expired_items(self, session, grace_period=0, max_per_loop=1000):
        num_iters = 0
        while num_iters < 100:
            session.query("PURGE_SOME_EXPIRED_ITEMS", { "maxitems": max_per_loop })
            num_iters += 1

This tries to delete items in small batches of 1000 so as not to tie up db resources.  But because it's inside a @with_session context, I suspect that it may actually runs all of its iterations inside a single long-running transaction.

Refactoring it to create a new transaction for each batch might mean it takes longer to run but is kinder to concurrent queries, something like:

    def purge_expired_items(self, grace_period=0, max_per_loop=1000):
        num_iters = 0
        while num_iters < 100:
            with self._get_or_create_session() as session:
                session.query("PURGE_SOME_EXPIRED_ITEMS", { "maxitems": max_per_loop })
            num_iters += 1

Adding an explicit `SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED` could further reduce contention, according to [1] it reduces the amount of index space locked by a DELETE statement.  Dirty reads shouldn't be a problem since this script only purges old records.

[1] https://dev.mysql.com/doc/refman/5.6/en/set-transaction.html
Blocks: 1250189
Flags: firefox-backlog+
Flags: firefox-backlog+
Whiteboard: [sync-data-integrity]
bobm to work with team for strategy to fix
Assignee: nobody → bobm
Priority: -- → P2
Priority: P2 → P3
Whiteboard: [sync-data-integrity]
BTW the stuff in Comment 2 has been addressed as part of the batch-uploads branch, we'll need to link to it from here when that merges to master.
I've hacked a suggested fix into stage that removes the ORDER BY clause from the PURGE_SOME_EXPIRED_ITEMS SQL query.  :kthiessen stage is ready for testing.
Flags: needinfo?(kthiessen)
QA Contact: kthiessen
I will run a load test against Stage this evening or tomorrow morning PDT, to see if that fix holds up.
Flags: needinfo?(kthiessen)
Bob, are we waiting for the sharding fix in https://bugzilla.mozilla.org/show_bug.cgi?id=1273102#c18 to go in before we load test this, or should I just go ahead and crank up some load?
Flags: needinfo?(bobm)
(In reply to Karl Thiessen [:kthiessen] from comment #7)
> Bob, are we waiting for the sharding fix in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1273102#c18 to go in before we
> load test this, or should I just go ahead and crank up some load?

I think we can test this independently.  If it doesn't spike immediately on load, I think we can verify this fix.
Flags: needinfo?(bobm)
Starting 90-minute load test now.
(In reply to Karl Thiessen [:kthiessen] from comment #9)
> Starting 90-minute load test now.

Load test results look good to me.  There was no read IOPs spike.
OK.  Taking several steps to mark this VERIFIED FIXED.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.