1626707 - Rust Purge TTL script is failing while trying to delete BSO data

Reporter

Description

•

4 years ago

Logs show that it deletes batches data then aborts on the bso purge at approximately 10 minutes, much like the python version did.

[2020-04-01T04:00:01Z INFO  purge_ttl] For projects/moz-fx-sync-prod-3f0c/instances/sync/databases/syncdb4
[2020-04-01T04:03:00Z INFO  purge_ttl] batches: removed 130750 rows
[2020-04-01T04:03:00Z INFO  purge_ttl] ⌚ "syncstorage.purge_ttl.batches_duration:178386|ms"
[2020-04-01T04:13:05Z INFO  purge_ttl] ⌚ "syncstorage.purge_ttl.bso_duration:605017|ms"
[2020-04-01T04:13:05Z INFO  purge_ttl] ⌚ "syncstorage.purge_ttl.total_duration:783403|ms"
Error: RpcFailure(RpcStatus { status: RpcStatusCode(10), details: Some("Aborted due to transient fault") })

Philip Jenvey [:pjenvey]

Comment 1

•

4 years ago

•

Edited

The Python version's error is different:

{"datetime": "2020-03-20 04:22:57,559", "message": "Starting purge_ttl.py"}
{"datetime": "2020-03-20 04:22:57,559", "message": "For sync:syncdb4"}
{"datetime": "2020-03-20 04:24:02,090", "message": "batches: removed 587 rows, batches_duration: 0:01:04.530411"}
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/google/api_core/grpc_helpers.py", line 79, in next
    return six.next(self._wrapped)
  File "/usr/local/lib/python3.7/dist-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/dist-packages/grpc/_channel.py", line 706, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1584681841.118465835","description":"Error received from peer ipv4:74.125.20.95:443","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>The above exception was the direct cause of the following exception:Traceback (most recent call last):
  File "/app/tools/spanner/purge_ttl.py", line 76, in <module>
    spanner_read_data()
  File "/app/tools/spanner/purge_ttl.py", line 65, in spanner_read_data
    result = database.execute_partitioned_dml(query)
  File "/usr/local/lib/python3.7/dist-packages/google/cloud/spanner_v1/database.py", line 395, in execute_partitioned_dml
    list(result_set)  # consume all partials
  File "/usr/local/lib/python3.7/dist-packages/google/cloud/spanner_v1/streamed.py", line 143, in __iter__
    self._consume_next()
  File "/usr/local/lib/python3.7/dist-packages/google/cloud/spanner_v1/streamed.py", line 116, in _consume_next
    response = six.next(self._response_iterator)
  File "/usr/local/lib/python3.7/dist-packages/google/cloud/spanner_v1/snapshot.py", line 45, in _restart_on_unavailable
    for item in iterator:
  File "/usr/local/lib/python3.7/dist-packages/google/api_core/grpc_helpers.py", line 81, in next
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.DeadlineExceeded: 504 Deadline Exceeded

According to https://github.com/grpc/grpc/blob/master/doc/statuscodes.md it's possible the call has completed successfully despite a DEADLINE_EXCEEDED error. There's a notion between timeouts and deadlines: https://grpc.io/blog/deadlines/ -- I know the Python client configures a timeout but I wonder if it has an inappropriate deadline for this.

As for the Rust version, grpc clients tend to have retry logic -- and ABORTED signals this failed "typically due to a concurrency issue", indicating a client should retry a certain number of times. I'm not sure our low level rust library (basically the c++ grpcio internals) do any retrying for us.

Erik Olson

Reporter

Comment 2

•

4 years ago

I suspect the rust version is not completing the deletions. The job controller runs one retry of the entire pod (which runs the script again) when the first run ends in failure. The retry is also failing each time.

Rachel Tublitz [:rachel]

Updated

•

4 years ago

Priority: -- → P1

Rachel Tublitz [:rachel]

Updated

•

4 years ago

Assignee: nobody → dpreston

Erik Olson

Reporter

Comment 3

•

4 years ago

GCP support said they recently fixed a similar bug in the java client and have added the fix to the python client in v1.15.1:
https://github.com/googleapis/python-spanner/pull/49

It is a grpc keepalive setting. I built an image with the new python client and tested in stage and productin. In stage it worked normally, in prod it now fails at the 10 minute mark with a new error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 96, in next
    return six.next(self._wrapped)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 706, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.ABORTED
	details = "Aborted due to transient fault"
	debug_error_string = "{"created":"@1586958994.625062056","description":"Error received from peer ipv4:74.125.197.95:443","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Aborted due to transient fault","grpc_status":10}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/purge_ttl.py", line 78, in <module>
    spanner_read_data()
  File "/app/purge_ttl.py", line 67, in spanner_read_data
    result = database.execute_partitioned_dml(query)
  File "/usr/local/lib/python3.7/site-packages/google/cloud/spanner_v1/database.py", line 463, in execute_partitioned_dml
    list(result_set)  # consume all partials
  File "/usr/local/lib/python3.7/site-packages/google/cloud/spanner_v1/streamed.py", line 143, in __iter__
    self._consume_next()
  File "/usr/local/lib/python3.7/site-packages/google/cloud/spanner_v1/streamed.py", line 116, in _consume_next
    response = six.next(self._response_iterator)
  File "/usr/local/lib/python3.7/site-packages/google/cloud/spanner_v1/snapshot.py", line 46, in _restart_on_unavailable
    for item in iterator:
  File "/usr/local/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 99, in next
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.Aborted: 409 Aborted due to transient fault

Erik Olson

Reporter

Comment 4

•

4 years ago

Testing locally with a new rust version created by fzzzy.

new version does a select with limit to find purgable records, then does a batch delete of all records found in the select.

Erik Olson

Reporter

Comment 5

•

4 years ago

New rust purge script was tested in stage. Waiting for validation/testing before running in production.

Flags: needinfo?(rtublitz)

Rachel Tublitz [:rachel]

Updated

•

4 years ago

URL: https://github.com/mozilla-services/s...

Flags: needinfo?(rtublitz)

Rachel Tublitz [:rachel]

Comment 6

•

4 years ago

•

Edited

whoops - didn't mean to remove the ni? from myself. I'll keep it open until this is rolling. Currently working with rbillings and fzzzy to coordinate details. doc lives here

Flags: needinfo?(rtublitz)

Rachel Tublitz [:rachel]

Comment 7

•

4 years ago

Rebecca, I know you were working on getting testing rolling here: did we get that done last week, or is this still in progress?

Flags: needinfo?(rtublitz) → needinfo?(rbillings)

Rebecca Billings [:rbillings]

Comment 8

•

4 years ago

Testing was completed 4/23 and was successful.

Flags: needinfo?(rbillings)

Rachel Tublitz [:rachel]

Updated

•

4 years ago

Assignee: dpreston → jrconlin

Status: NEW → ASSIGNED

See Also: → https://github.com/mozilla-services/syncstorage-rs/pull/620

Rachel Tublitz [:rachel]

Updated

•

4 years ago

See Also: → https://github.com/mozilla-services/syncstorage-rs/issues/631

Erik Olson

Reporter

Comment 9

•

4 years ago

The new python spanner module has implemented retry of partitioned DML queries. It seems to finally be working the way it is supposed to (or the way it was described to be working a few months ago).

The cron has been running overnight since Thurs/Friday night (5/15). The purge on the bsos table errors out at 3600s. The kubernetes job is configured to retry the job 2 more times. Because the jobs have so far always ended in error, we are unable to know how many rows are purged each time. We have been occasionally running a very expensive query to count all the expired rows, and see that it is coming down:

5/8 - 117 million
5/16 - 20 million
5/17 - 9 million

Once a row is purged, it can take up to 48 hours for the data on disk to be removed from the cluster. We can infer that a few 100 gigabytes of data has been removed as the total bytes metric has remained flat for several days (image attached showing 30 days of spanner disk usage).

I expected it to get to zero overnight last night but it did not. Presumably when it gets to zero, a job will complete successfully and log the number of bsos rows it purged.

Erik Olson

Reporter

Comment 10

•

4 years ago

Attached image 30-day spanner disk usage — Details

Erik Olson

Reporter

Comment 11

•

4 years ago

I ran the row count query this evening: select count(*) from bsos where expiry < CURRENT_TIMESTAMP() and got:
46,064,326

Now I am quite confused!

Erik Olson

Reporter

Comment 12

•

4 years ago

This morning it is down to 9,586,779 again.

David Durst [:ddurst]

Comment 13

•

1 year ago

Rust version is not used, the Python version is. Closing as INVALID.

Status: ASSIGNED → RESOLVED

Closed: 1 year ago

Resolution: --- → INVALID

BMO Automation

Updated

•

1 year ago

Product: Cloud Services → Cloud Services Graveyard