Closed Bug 1626707 Opened 4 years ago Closed 1 year ago

Rust Purge TTL script is failing while trying to delete BSO data

Categories

(Cloud Services Graveyard :: Server: Sync, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: eolson, Assigned: jrconlin)

References

()

Details

Attachments

(1 file)

Logs show that it deletes batches data then aborts on the bso purge at approximately 10 minutes, much like the python version did.

[2020-04-01T04:00:01Z INFO  purge_ttl] For projects/moz-fx-sync-prod-3f0c/instances/sync/databases/syncdb4
[2020-04-01T04:03:00Z INFO  purge_ttl] batches: removed 130750 rows
[2020-04-01T04:03:00Z INFO  purge_ttl] ⌚ "syncstorage.purge_ttl.batches_duration:178386|ms"
[2020-04-01T04:13:05Z INFO  purge_ttl] ⌚ "syncstorage.purge_ttl.bso_duration:605017|ms"
[2020-04-01T04:13:05Z INFO  purge_ttl] ⌚ "syncstorage.purge_ttl.total_duration:783403|ms"
Error: RpcFailure(RpcStatus { status: RpcStatusCode(10), details: Some("Aborted due to transient fault") })

The Python version's error is different:

{"datetime": "2020-03-20 04:22:57,559", "message": "Starting purge_ttl.py"}
{"datetime": "2020-03-20 04:22:57,559", "message": "For sync:syncdb4"}
{"datetime": "2020-03-20 04:24:02,090", "message": "batches: removed 587 rows, batches_duration: 0:01:04.530411"}
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/google/api_core/grpc_helpers.py", line 79, in next
    return six.next(self._wrapped)
  File "/usr/local/lib/python3.7/dist-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/dist-packages/grpc/_channel.py", line 706, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1584681841.118465835","description":"Error received from peer ipv4:74.125.20.95:443","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>The above exception was the direct cause of the following exception:Traceback (most recent call last):
  File "/app/tools/spanner/purge_ttl.py", line 76, in <module>
    spanner_read_data()
  File "/app/tools/spanner/purge_ttl.py", line 65, in spanner_read_data
    result = database.execute_partitioned_dml(query)
  File "/usr/local/lib/python3.7/dist-packages/google/cloud/spanner_v1/database.py", line 395, in execute_partitioned_dml
    list(result_set)  # consume all partials
  File "/usr/local/lib/python3.7/dist-packages/google/cloud/spanner_v1/streamed.py", line 143, in __iter__
    self._consume_next()
  File "/usr/local/lib/python3.7/dist-packages/google/cloud/spanner_v1/streamed.py", line 116, in _consume_next
    response = six.next(self._response_iterator)
  File "/usr/local/lib/python3.7/dist-packages/google/cloud/spanner_v1/snapshot.py", line 45, in _restart_on_unavailable
    for item in iterator:
  File "/usr/local/lib/python3.7/dist-packages/google/api_core/grpc_helpers.py", line 81, in next
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.DeadlineExceeded: 504 Deadline Exceeded

According to https://github.com/grpc/grpc/blob/master/doc/statuscodes.md it's possible the call has completed successfully despite a DEADLINE_EXCEEDED error. There's a notion between timeouts and deadlines: https://grpc.io/blog/deadlines/ -- I know the Python client configures a timeout but I wonder if it has an inappropriate deadline for this.

As for the Rust version, grpc clients tend to have retry logic -- and ABORTED signals this failed "typically due to a concurrency issue", indicating a client should retry a certain number of times. I'm not sure our low level rust library (basically the c++ grpcio internals) do any retrying for us.

I suspect the rust version is not completing the deletions. The job controller runs one retry of the entire pod (which runs the script again) when the first run ends in failure. The retry is also failing each time.

Priority: -- → P1
Assignee: nobody → dpreston

GCP support said they recently fixed a similar bug in the java client and have added the fix to the python client in v1.15.1:
https://github.com/googleapis/python-spanner/pull/49

It is a grpc keepalive setting. I built an image with the new python client and tested in stage and productin. In stage it worked normally, in prod it now fails at the 10 minute mark with a new error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 96, in next
    return six.next(self._wrapped)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 706, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.ABORTED
	details = "Aborted due to transient fault"
	debug_error_string = "{"created":"@1586958994.625062056","description":"Error received from peer ipv4:74.125.197.95:443","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Aborted due to transient fault","grpc_status":10}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/purge_ttl.py", line 78, in <module>
    spanner_read_data()
  File "/app/purge_ttl.py", line 67, in spanner_read_data
    result = database.execute_partitioned_dml(query)
  File "/usr/local/lib/python3.7/site-packages/google/cloud/spanner_v1/database.py", line 463, in execute_partitioned_dml
    list(result_set)  # consume all partials
  File "/usr/local/lib/python3.7/site-packages/google/cloud/spanner_v1/streamed.py", line 143, in __iter__
    self._consume_next()
  File "/usr/local/lib/python3.7/site-packages/google/cloud/spanner_v1/streamed.py", line 116, in _consume_next
    response = six.next(self._response_iterator)
  File "/usr/local/lib/python3.7/site-packages/google/cloud/spanner_v1/snapshot.py", line 46, in _restart_on_unavailable
    for item in iterator:
  File "/usr/local/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 99, in next
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.Aborted: 409 Aborted due to transient fault

Testing locally with a new rust version created by fzzzy.

new version does a select with limit to find purgable records, then does a batch delete of all records found in the select.

New rust purge script was tested in stage. Waiting for validation/testing before running in production.

Flags: needinfo?(rtublitz)
Flags: needinfo?(rtublitz)

whoops - didn't mean to remove the ni? from myself. I'll keep it open until this is rolling. Currently working with rbillings and fzzzy to coordinate details. doc lives here

Flags: needinfo?(rtublitz)

Rebecca, I know you were working on getting testing rolling here: did we get that done last week, or is this still in progress?

Flags: needinfo?(rtublitz) → needinfo?(rbillings)

Testing was completed 4/23 and was successful.

Flags: needinfo?(rbillings)
Assignee: dpreston → jrconlin
Status: NEW → ASSIGNED

The new python spanner module has implemented retry of partitioned DML queries. It seems to finally be working the way it is supposed to (or the way it was described to be working a few months ago).

The cron has been running overnight since Thurs/Friday night (5/15). The purge on the bsos table errors out at 3600s. The kubernetes job is configured to retry the job 2 more times. Because the jobs have so far always ended in error, we are unable to know how many rows are purged each time. We have been occasionally running a very expensive query to count all the expired rows, and see that it is coming down:

5/8 - 117 million
5/16 - 20 million
5/17 - 9 million

Once a row is purged, it can take up to 48 hours for the data on disk to be removed from the cluster. We can infer that a few 100 gigabytes of data has been removed as the total bytes metric has remained flat for several days (image attached showing 30 days of spanner disk usage).

I expected it to get to zero overnight last night but it did not. Presumably when it gets to zero, a job will complete successfully and log the number of bsos rows it purged.

I ran the row count query this evening: select count(*) from bsos where expiry < CURRENT_TIMESTAMP() and got:
46,064,326

Now I am quite confused!

This morning it is down to 9,586,779 again.

Rust version is not used, the Python version is. Closing as INVALID.

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → INVALID
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: