Closed Bug 1388963 Opened 7 years ago Closed 7 years ago

Increased 503s on some servers since 1.6.8 rollout

Categories

(Cloud Services Graveyard :: Server: Sync, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bobm, Assigned: rfkelly)

References

Details

Attachments

(1 file, 2 obsolete files)

Select servers with the Sync Batch API enabled and running the 1.6.8 server version have an increased 503 rate.  These errors are caused by MySQL 1205 Lock Timeouts.  On one server all requests were POSTs and spread across all collections.  Interestingly, all errors were for users sharded to one of two batch uploads tables (0, and 10).
> Interestingly, all errors were for users sharded to one of two batch uploads tables (0, and 10).

This sounds familiar, I recall a previous bug where we were *only* sending users to those two tables.  Can you confirm whether users are being correctly sharded across all the batch upload tables in general?

> Select servers with the Sync Batch API enabled and running the 1.6.8 server version have an increased 503 rate.

To be clear, some batch-enabled servers are showing errors, but other batch-enabled servers are not showing anything unusual?
(In reply to Ryan Kelly [:rfkelly] from comment #1)

> This sounds familiar, I recall a previous bug where we were *only* sending
> users to those two tables.  Can you confirm whether users are being
> correctly sharded across all the batch upload tables in general?

The other tables are getting data, and users sharded to them are successful in uploading batches.  An interesting thing to note: according to the table status summary the 0 and 10 tables have roughly twice as many rows as the other tables.  That's just an approximation however, a SELECT COUNT showed that table 0 had around 10x as many rows, and I didn't wait for table 10's query to finish running.

The fix might be to turn off batch uploads on this server and truncate; or drop and recreate all the batch uploads tables before re-enabling.  
 
> To be clear, some batch-enabled servers are showing errors, but other
> batch-enabled servers are not showing anything unusual?

Correct.
On the server that's showing the unusual skew to some batch tables, can we check if there's a similar unusual skew in the last digit of the userid?  In other words, does this box somehow have a lot more users whose userid ends in "0" than ends in some other number?

> The fix might be to turn off batch uploads on this server and truncate; or drop and recreate
> all the batch uploads tables before re-enabling.

+1; it seems reasonable to assume that the load problems are due to this skew, and we could see whether it persists after clearing out the old data.
(In reply to Ryan Kelly [:rfkelly] from comment #3)
> On the server that's showing the unusual skew to some batch tables, can we
> check if there's a similar unusual skew in the last digit of the userid?  In
> other words, does this box somehow have a lot more users whose userid ends
> in "0" than ends in some other number?

The user ids are evenly distributed among the 20 tables. 

> 
> +1; it seems reasonable to assume that the load problems are due to this
> skew, and we could see whether it persists after clearing out the old data.

Clearing the data did the trick.  Will make that part of the re-enabling procedure.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Reopening this Bug as the error came up on another node this week, but was distributed across the batch_upload_items2 and batch_upload_items12 tables.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
One theory is that this node has an unusually large number of uploads from from users whose uid ends in 2.  Bobm, I wonder if we can check this by doing a quick analysis of the nginx logs, counting the number of POST requests in e.g. a 24 hour period, grouped by the final digit of the userid.  If this shows a spike in POST count for uids with "2" compared to the general population, then that'll be a clue as to the cause here.

(I'm happy to do the crunching here if you don't have the bandwidth for it, and can provide an appropriate log snapshot)
Flags: needinfo?(bobm)
(In reply to Ryan Kelly [:rfkelly] from comment #6)
> One theory is that this node has an unusually large number of uploads from
> from users whose uid ends in 2.  Bobm, I wonder if we can check this by
> doing a quick analysis of the nginx logs, counting the number of POST
> requests in e.g. a 24 hour period, grouped by the final digit of the userid.
> If this shows a spike in POST count for uids with "2" compared to the
> general population, then that'll be a clue as to the cause here.
> 
> (I'm happy to do the crunching here if you don't have the bandwidth for it,
> and can provide an appropriate log snapshot)

Several of the servers have two hot tables that are separated by 10: 5, and 15; 2 and 12, 7 and 17, etc.  However, sync userids and batch uploads are evenly spread across all twenty tables.  It is conceivable that if servers are built in large batches, groups of ten say, that Tokenserver would hand out userids in order and cause hot spots like this.  I see no evidence of this in practice.  

As discussed in the weekly stand-up, I will enable the Batch API on two groups of servers.  One where I predict which hot spots are likely based on the population of the idle batch_uploads* tables.  And another where I truncate the batch_uploads* tables before enabling the feature.  We leave it run after this fashion and observe if the expected behavior results.

The expected behavior: 
Group 1 will exhibit elevated 503 response codes from MySQL 1205 lock timeouts on the predicted tables.
Group 2 will exhibit negligible 503 counts.

I propose we follow the suggest from the meeting of putting ten servers in each group.  I plan on resolving any correct assumptions from the Group 1 set by turning off the Batch API, and truncating the offending tables.
Flags: needinfo?(bobm)
:thumbsup:

> two hot tables that are separated by 10: 5, and 15; 2 and 12, 7 and 17

This separation by 10 is a symptom of the way we do the sharding.  The last digit of the batchid is always the last digit of the user, but the other digits come from a timestamp.  So if you have a particularly heavy user with uid ending in 7, all of their writes will go into either batch table 7 or batch table 17, causing the pattern here.
We found an issue with the way the purge job handles batch items, which may explain the problems here.  The purge query we're currently using is (the DELETE-equivalent of) this:

  MariaDB [weave0]> EXPLAIN SELECT * FROM batch_upload_items0 WHERE batch < (UNIX_TIMESTAMP() - 60000) * 1000
                 -> ORDER BY batch LIMIT 100000;
  +------+-------------+-------------+------+---------------+------+---------+------+------+-----------------------------+
  | id   | select_type | table       | type | possible_keys | key  | key_len | ref  | rows | Extra                       |
  +------+-------------+-------------+------+---------------+------+---------+------+------+-----------------------------+
  |    1 | SIMPLE      | bat[...]ms0 | ALL  | PRIMARY       | NULL | NULL    | NULL |    1 | Using where; Using filesort |
  +------+-------------+-------------+------+---------------+------+---------+------+------+-----------------------------+
  1 row in set (0.00 sec)

That "Using where; Using filesort" means that each time this job runs, it does a full table scan of potentially billions of rows, dumps them all to disk in sorted order, then picks from that list the ones it wants to delete.  I can imagine this causing some unexpected 503s for tables with many rows!

This query *should* be able to use the primary key index on "batch" to execute without a filesort, but the use of UNIX_TIMESTAMP() in the WHERE clause seems to be preventing it.  If it replace that with a constant we get more sensible behaviour:

  MariaDB [weave0]> EXPLAIN SELECT * FROM batch_upload_items0 WHERE batch < (333333333 - 1000) * 1000 ORDER BY batch LIMIT 100000;
  +------+-------------+---------------------+-------+---------------+---------+---------+------+------+-------------+
  | id   | select_type | table               | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
  +------+-------------+---------------------+-------+---------------+---------+---------+------+------+-------------+
  |    1 | SIMPLE      | batch_upload_items0 | range | PRIMARY       | PRIMARY | 8       | NULL |    1 | Using where |
  +------+-------------+---------------------+-------+---------------+---------+---------+------+------+-------------+
  1 row in set (0.00 sec)


On an interesting side note, the use of UNIX_TIMESTAMP() doesn't seem to prevent using a non-primary-key index for similar queries, such as this one for deleting items with expired ttls:

  MariaDB [weave0]> EXPLAIN SELECT * FROM bso0 WHERE ttl < (UNIX_TIMESTAMP() - 1000) ORDER BY ttl LIMIT 10000;
  +------+-------------+-------+-------+---------------+-------------+---------+------+------+-------------+
  | id   | select_type | table | type  | possible_keys | key         | key_len | ref  | rows | Extra       |
  +------+-------------+-------+-------+---------------+-------------+---------+------+------+-------------+
  |    1 | SIMPLE      | bso0  | range | bso_ttl_idx   | bso_ttl_idx | 4       | NULL |    5 | Using where |
  +------+-------------+-------+-------+---------------+-------------+---------+------+------+-------------+
  1 row in set (0.00 sec)

So I wonder if this is some sort of bug in MySQL's query planner...

Anyway, there's a clear workaround so I'm going to make a quick patch for it and we'll see if that resolves the 503 issue.
Bob could you please sanity-check this fairly trivial patch for me?
Attachment #8900985 - Flags: review?(bobm)
(In reply to Ryan Kelly [:rfkelly] from comment #10)
> Created attachment 8900985 [details] [review]
> PR to avoid use of UNIX_TIMESTAMP() in db queries
> 
> Bob could you please sanity-check this fairly trivial patch for me?

r+'d in git with the caveat that I'm not a python programmer.  From a sanity check perspective it looks good to me, and it's a bonus to have those queries the same in Postgres and Sqlite.
Attachment #8900985 - Flags: review?(bobm)
Assignee: nobody → rfkelly
Depends on: 1394253
Component: Firefox Sync: Backend → Server: Sync
I have been getting

1506947925722	Sync.Collection	WARN	POST request to https://sync-444-us-west-2.sync.services.mozilla.com/1.5/62856219/storage/history?batch=true failed with status 503
1506947925722	Sync.Engine.History	DEBUG	Uploading records failed: <html>
<head><title>503 Service Unavailable</title></head>
<body bgcolor="white">
<center><h1>503 Service Cloudy: Try again later. </h1></center>
</body>
</html>

regularly on my sync logs since a couple of days now but https://status.services.mozilla.com/ says the service is up and indeed some syncs do seem to work. Is this the reason?
(In reply to Rui Matos from comment #12)

> regularly on my sync logs since a couple of days now but
> https://status.services.mozilla.com/ says the service is up and indeed some
> syncs do seem to work. Is this the reason?

There was a long running query of this nature that has been killed on the server that should clear up your present issue.  We do need to do additional tuning of this replacement query.

EXPLAIN SELECT * FROM batch_upload_itemsX WHERE batch < (1506709167 - 7200 - 86400) * 1000 ORDER BY batch LIMIT 1000;
+------+-------------+---------------------+-------+---------------+---------+---------+------+------+-------------+
| id   | select_type | table               | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+------+-------------+---------------------+-------+---------------+---------+---------+------+------+-------------+
|    1 | SIMPLE      | batch_upload_itemsX | index | PRIMARY       | PRIMARY | 78      | NULL | 1000 | Using where |
+------+-------------+---------------------+-------+---------------+---------+---------+------+------+-------------+

 EXPLAIN SELECT * FROM batch_upload_itemsX     WHERE batch < (1506709167 - 7200 - 86400) ORDER BY batch LIMIT 1000;
+------+-------------+---------------------+-------+---------------+---------+---------+------+-------+-------------+
| id   | select_type | table               | type  | possible_keys | key     | key_len | ref  | rows  | Extra       |
+------+-------------+---------------------+-------+---------------+---------+---------+------+-------+-------------+
|    1 | SIMPLE      | batch_upload_itemsX | range | PRIMARY       | PRIMARY | 8       | NULL | 19377 | Using where |
+------+-------------+---------------------+-------+---------------+---------+---------+------+-------+-------------+


:rfkelly any ideas?
Flags: needinfo?(rfkelly)
Also:

EXPLAIN SELECT * FROM batch_upload_items3 WHERE batch = XXXXXXXXXXXXXXX AND userid = XXXXXXXX;
+------+-------------+---------------------+------+---------------+---------+---------+-------------+------+-------+
| id   | select_type | table               | type | possible_keys | key     | key_len | ref         | rows | Extra |
+------+-------------+---------------------+------+---------------+---------+---------+-------------+------+-------+
|    1 | SIMPLE      | batch_upload_itemsX | ref  | PRIMARY       | PRIMARY | 12      | const,const |    1 |       |
+------+-------------+---------------------+------+---------------+---------+---------+-------------+------+-------+
To add some more context, when :bob killed the above long-running query, it said:

```
  COMMAND: Query
     TIME: 123620
    STATE: Queried about 67XXXXXXXXX rows
     INFO: /* [queryName=CLOSE_BATCH_ITEMS] */
    DELETE FROM batch_upload_items3
    WHERE batch = 1506XXXXXXXXX AND userid = XXXXXXXX
```

That is, it claimed to have examined some 67 billion rows in order to do this simple primary-key-based deletion, which is crazypants.  Bob, that `EXPLAIN` output is definitely from the same query run against the same table on the same host, right?
If we catch another one in the act, I wonder if it's possible to use `EXPLAIN FOR CONNECTION` to get the active query plan for the actual query that's running:

  https://dev.mysql.com/doc/refman/5.7/en/explain-for-connection.html

It may not be possible if this version of MySQL does not support `EXPLAIN` for `DELETE` queries :-(
Having killed the long-running query, do we see it re-occurring, or was it a one-off thing?  In particular the long-running `CLOSE_BATCH_ITEMS` might re-occur because that happens during the normal course of a sync, not as part of a background job.

(Or, did you purge the batch tables in order to unstick this box?)
Flags: needinfo?(rfkelly) → needinfo?(bobm)
I think we need better visibility into the performance of these queries before they get to "catastrophic full table scan".  Bob, what do you think about adding timers as in the linked PR?

I believe the existing metrics_timer stuff goes somewhere in sync operational logs, but do we capture logs from the background purge process?  What format should I log in to ensure we can get the timings out of there and into a graph somewhere?
Attachment #8914585 - Flags: feedback?(bobm)
Hmm, assuming that the batch item tables are using tokudb, this article seems super interesting and relevant:

  https://www.percona.com/blog/2015/02/11/tokudb-table-optimization-improvements/

It opens with "For tables whose workload pattern is a high number of sequential deletes, some operational maintenance is required to ensure consistently fast performance", and that certainly describes how we're using the batch item tables - inserting and then deleting items with sequentially increasing batch ids at a pretty steady clip.

We might need to consider running periodic background optimization on these tables :-/
From vidyo...

> Having killed the long-running query, do we see it re-occurring, or was it a one-off thing?
> (Or, did you purge the batch tables in order to unstick this box?)

It recurs.  Truncating the batch item tables makes the problem go away.

> Hmm, assuming that the batch item tables are using tokudb, this article seems super interesting and relevant

They are indeed using tokudb.

Bob is going to find one of the problematic tables and try doing an `OPTIMIZE TABLE` rather than a `TRUNCATE`.  If that makes the problem go away, it's a good clue that we're looking at index degradation of the sort described in the above article.

> We might need to consider running periodic background optimization on these tables :-/

Our version of MySQL is too old to support online table optimization.  We can try to update to a newer version, or we can try to use `tokudb_optimize_index_fraction` and related options to build an incremental optimization procedure that doesn't take long-lived table locks.

Either way, first step here is to see whether optimizing these tables fixes the bad query performance, so leaving the ni? for :bobm to report on his experiments there.
We could also just try making the batch item tables regular old InnoDB tables rather than TokuDB; their pattern of write access is different enough to the regular old bso tables that we might get good performance out of InnoDB.
(In reply to Ryan Kelly [:rfkelly] from comment #20)

> Bob is going to find one of the problematic tables and try doing an
> `OPTIMIZE TABLE` rather than a `TRUNCATE`.  If that makes the problem go
> away, it's a good clue that we're looking at index degradation of the sort
> described in the above article.

Earlier this morning I briefly disabled batch uploads on a server facing this problem and ran an OPTIMIZE TABLE on all of the batch uploads tables.  After re-enabling batch uploads the problem did not resurface on this server.

> Our version of MySQL is too old to support online table optimization.  We
> can try to update to a newer version, or we can try to use
> `tokudb_optimize_index_fraction` and related options to build an incremental
> optimization procedure that doesn't take long-lived table locks.

The longest OPTIMIZE TABLE operation took 4.5 seconds to complete.  That's short enough that we may want to just occasionally run full table optimizations instead of the incremental procedures.
Flags: needinfo?(bobm)
If they're small and fast enough, we could consider throwing in an explicit `OPTIMIZE TABLE` at the end of every purge run, just to keep things under control.
Also in theory, this means we could recover a server from this state by just running `OPTIMIZE TABLE`, without having to disable and re-enable the batch upload feature.  May be worth trying out next time as well :-)
Needs tests etc, but this is broadly what I'm thinking about adding:

  https://github.com/mozilla-services/server-syncstorage/pull/76

Bob, thoughts?
(In reply to Ryan Kelly [:rfkelly] from comment #24)
> Also in theory, this means we could recover a server from this state by just
> running `OPTIMIZE TABLE`, without having to disable and re-enable the batch
> upload feature.  May be worth trying out next time as well :-)

If things are that quick,  no better alternatives are found, and it's something that will need to be done on an ongoing basis, would it be an option to 5XX-inate the server for that period rather than disabling batches? That should be silent from the POV of users and would avoid the data-integrity risks with disabling batching.
> would it be an option to 5XX-inate the server for that period rather than disabling batches

If the `OPTIMIZE TABLE` takes a lock and holds for for more than a few seconds, you'll get the 503-inating behaviour for free by virtue of the db queries starting to time out...

But from what I've read and seen here I'm hopeful it can actually run concurrently without noticeable impact to clients.
(In reply to Ryan Kelly [:rfkelly] from comment #25)
> Needs tests etc, but this is broadly what I'm thinking about adding:
> 
>   https://github.com/mozilla-services/server-syncstorage/pull/76
> 
> Bob, thoughts?

Given Comment 27 and my anecdotal OPTIMIZE TABLE durations, I think this is more than reasonable, and we may be able to drop the index fraction altogether.
Note that as written, this would also optimize the bsoX tables as well as batch_upload_itemsX tables.  It might help make the TTL-based BSO purging run faster too!  But we should probably experiment with it under load.
> Note that as written, this would also optimize the bsoX tables as well as batch_upload_itemsX tables. 

:bobm, thoughts on whether we should also optimize the bsoX tables periodically, or just focus on the batch tables?
Flags: needinfo?(bobm)
(In reply to Ryan Kelly [:rfkelly] from comment #30)

> :bobm, thoughts on whether we should also optimize the bsoX tables
> periodically, or just focus on the batch tables?

It might improve overall speed, but I would want to test it pretty thoroughly.  We don't necessarily have a problem there at present.  Add a feature flag, and default to off?
Flags: needinfo?(bobm)
After a lot of wasted time figuring out why my tests were passing when they should have failed, I've finally got a PR ready for this.

Bob, based on your feedback, this adds an `OPTIMIZE TABLE` to the `batch_uploads` and `batch_upload_items` tables after every prune.  I removed the fancypants "optimize fraction" stuff since it seems to run quickly anyway.

Kit, I'm not sure if I've asked you this before, but how's your python?  If you have some familiarity it's be great to get your r? on this since there's nothing too tricky going on, but it still needs a good sanity-check :-)
Attachment #8900985 - Attachment is obsolete: true
Attachment #8914585 - Attachment is obsolete: true
Attachment #8914585 - Flags: feedback?(bobm)
Attachment #8920990 - Flags: review?(kit)
Attachment #8920990 - Flags: feedback?(bobm)
Attachment #8920990 - Flags: review?(kit) → review+
Comment on attachment 8920990 [details] [review]
PR to automatically optimize batch tables after pruning

Sounds good to me.
Attachment #8920990 - Flags: feedback?(bobm) → feedback+
I've tagged v1.6.11 including this change, deploy in Bug 1411192.  I'm going to tentatively mark this as RESOLVED/FIXED, let's either verify it or re-open it once we know whether that deploy sticks and helps in production.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: