Closed Bug 1749309 Opened 4 years ago Closed 4 years ago

cleanse_upload takes too long to run

Categories

(Tecken :: General, defect, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

Details

Attachments

(1 file)

In bug #1638412, I wrote a cleanse_upload command that cleaned out several tables that accumulate data about uploads, but didn't expire the data in the table when the corresponding uploads expire out of the S3 bucket.

Cool.

Except we can't run the command because it takes too long to run.

This bug covers redoing the command to make it possible to run or figuring out some other way to expire the data in the table.

willkg merged PR #2471: "bug 1749309: rework cleanse_upload to skip SELECTs" in 53d71e1.

We'll test this in stage, but it's not clear how helpful that'll be given stage has very little data.

This deployed to stage.

Jason: Can you run this in stage?:

$ pwd
/app
$ time ./manage.py cleanse_upload --dry-run
<output>
$ time ./manage.py cleanse_upload
<output>

Then copy and paste the output (it shouldn't have any secrets) here. Thanks!

Flags: needinfo?(jthomas)

See outputs below:

app@ip-172-31-0-33:/app$ time ./manage.py cleanse_upload --dry-run
{"Timestamp": 1641917185890163968, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 23, "Fields": {"msg": "Statsd buffering is disabled"}}
{"Timestamp": 1641917185890280448, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 23, "Fields": {"msg": "Statsd periodic buffer flush is disabled"}}
{"Timestamp": 1641917185892795392, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 23, "Fields": {"msg": "Statsd buffering is disabled"}}
{"Timestamp": 1641917185892888832, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 23, "Fields": {"msg": "Statsd periodic buffer flush is disabled"}}
{"Timestamp": 1641917185892934912, "Type": "markus.backends.datadog", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 23, "Fields": {"msg": "DatadogMetrics configured: localhost:8125 "}}
{"Timestamp": 1641917186064189952, "Type": "botocore.credentials", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 23, "Fields": {"msg": "Found credentials in environment variables."}}
{"Timestamp": 1641917186165501696, "Type": "botocore.credentials", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 23, "Fields": {"msg": "Found credentials in environment variables."}}
cleanse_upload: THIS IS A DRY RUN.
cleanse_upload: count before cleansing: upload=194, fileupload=4912
cleanse_upload: try=True, cutoff=2021-12-12: deleted upload=0, fileupload=0
cleanse_upload: try=False, cutoff=2020-01-12: deleted upload=51, fileupload=4428

real	0m1.428s
user	0m0.867s
sys	0m0.096s
app@ip-172-31-0-33:/app$  time ./manage.py cleanse_upload
{"Timestamp": 1641917196316438784, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 24, "Fields": {"msg": "Statsd buffering is disabled"}}
{"Timestamp": 1641917196316556800, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 24, "Fields": {"msg": "Statsd periodic buffer flush is disabled"}}
{"Timestamp": 1641917196319038464, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 24, "Fields": {"msg": "Statsd buffering is disabled"}}
{"Timestamp": 1641917196319123968, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 24, "Fields": {"msg": "Statsd periodic buffer flush is disabled"}}
{"Timestamp": 1641917196319173632, "Type": "markus.backends.datadog", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 24, "Fields": {"msg": "DatadogMetrics configured: localhost:8125 "}}
{"Timestamp": 1641917196479342848, "Type": "botocore.credentials", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 24, "Fields": {"msg": "Found credentials in environment variables."}}
{"Timestamp": 1641917196579687168, "Type": "botocore.credentials", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 24, "Fields": {"msg": "Found credentials in environment variables."}}
cleanse_upload: count before cleansing: upload=194, fileupload=4912
cleanse_upload: try=True, cutoff=2021-12-12: deleted upload=0, fileupload=0
cleanse_upload: try=False, cutoff=2020-01-12: deleted upload=51, fileupload=4428

real	0m2.034s
user	0m0.860s
sys	0m0.098s
Flags: needinfo?(jthomas)

If you run the second command again, what does it print out?

$ time ./manage.py cleanse_upload
Flags: needinfo?(jthomas)
app@ip-172-31-0-33:/app$ time ./manage.py cleanse_upload
{"Timestamp": 1641922705935144448, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 25, "Fields": {"msg": "Statsd buffering is disabled"}}
{"Timestamp": 1641922705935268096, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 25, "Fields": {"msg": "Statsd periodic buffer flush is disabled"}}
{"Timestamp": 1641922705937828864, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 25, "Fields": {"msg": "Statsd buffering is disabled"}}
{"Timestamp": 1641922705937921024, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 25, "Fields": {"msg": "Statsd periodic buffer flush is disabled"}}
{"Timestamp": 1641922705937966336, "Type": "markus.backends.datadog", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 25, "Fields": {"msg": "DatadogMetrics configured: localhost:8125 "}}
{"Timestamp": 1641922706128034048, "Type": "botocore.credentials", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 25, "Fields": {"msg": "Found credentials in environment variables."}}
{"Timestamp": 1641922706228626688, "Type": "botocore.credentials", "Logger": "tecken", "Hostname": "ip-172-31-0-33.ec2.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 25, "Fields": {"msg": "Found credentials in environment variables."}}
cleanse_upload: count before cleansing: upload=143, fileupload=484
cleanse_upload: try=True, cutoff=2021-12-12: deleted upload=0, fileupload=0
cleanse_upload: try=False, cutoff=2020-01-12: deleted upload=0, fileupload=0

real	0m1.370s
user	0m0.900s
sys	0m0.099s
Flags: needinfo?(jthomas)

Awesome! It's generating two DELETE queries for try=True (file_upload and then upload) and for try=False (file_upload and then upload), so that'll take as long as the manual query we did in bug #1638412. It's definitely running and working and is way easier to run than the hand-crafted SQL.

Done on prod:

app@ip-172-31-28-151:/app$ time ./manage.py cleanse_upload --dry-run
{"Timestamp": 1641935510346351104, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-28-151.us-west-2.compute.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 127, "Fields": {"msg": "Statsd buffering is disabled"}}
{"Timestamp": 1641935510346476288, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-28-151.us-west-2.compute.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 127, "Fields": {"msg": "Statsd periodic buffer flush is disabled"}}
{"Timestamp": 1641935510354401792, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-28-151.us-west-2.compute.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 127, "Fields": {"msg": "Statsd buffering is disabled"}}
{"Timestamp": 1641935510354484480, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-28-151.us-west-2.compute.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 127, "Fields": {"msg": "Statsd periodic buffer flush is disabled"}}
{"Timestamp": 1641935510354526464, "Type": "markus.backends.datadog", "Logger": "tecken", "Hostname": "ip-172-31-28-151.us-west-2.compute.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 127, "Fields": {"msg": "DatadogMetrics configured: localhost:8125 "}}
{"Timestamp": 1641935510865625600, "Type": "botocore.credentials", "Logger": "tecken", "Hostname": "ip-172-31-28-151.us-west-2.compute.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 127, "Fields": {"msg": "Found credentials in environment variables."}}
{"Timestamp": 1641935510973184512, "Type": "botocore.credentials", "Logger": "tecken", "Hostname": "ip-172-31-28-151.us-west-2.compute.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 127, "Fields": {"msg": "Found credentials in environment variables."}}
cleanse_upload: THIS IS A DRY RUN.
cleanse_upload: count before cleansing: upload=257975, fileupload=22135370
cleanse_upload: try=True, cutoff=2021-12-12: deleted upload=54182, fileupload=9962815
cleanse_upload: try=False, cutoff=2020-01-12: deleted upload=67740, fileupload=4153589

real	4m49.196s
user	0m0.889s
sys	0m0.156s

app@ip-172-31-28-151:/app$ time ./manage.py cleanse_upload
{"Timestamp": 1641936085484666368, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-28-151.us-west-2.compute.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 134, "Fields": {"msg": "Statsd buffering is disabled"}}
{"Timestamp": 1641936085484787968, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-28-151.us-west-2.compute.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 134, "Fields": {"msg": "Statsd periodic buffer flush is disabled"}}
{"Timestamp": 1641936085487146240, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-28-151.us-west-2.compute.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 134, "Fields": {"msg": "Statsd buffering is disabled"}}
{"Timestamp": 1641936085487227648, "Type": "datadog.dogstatsd", "Logger": "tecken", "Hostname": "ip-172-31-28-151.us-west-2.compute.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 134, "Fields": {"msg": "Statsd periodic buffer flush is disabled"}}
{"Timestamp": 1641936085487270144, "Type": "markus.backends.datadog", "Logger": "tecken", "Hostname": "ip-172-31-28-151.us-west-2.compute.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 134, "Fields": {"msg": "DatadogMetrics configured: localhost:8125 "}}
{"Timestamp": 1641936085718197504, "Type": "botocore.credentials", "Logger": "tecken", "Hostname": "ip-172-31-28-151.us-west-2.compute.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 134, "Fields": {"msg": "Found credentials in environment variables."}}
{"Timestamp": 1641936085814236416, "Type": "botocore.credentials", "Logger": "tecken", "Hostname": "ip-172-31-28-151.us-west-2.compute.internal", "EnvVersion": "2.0", "Severity": 6, "Pid": 134, "Fields": {"msg": "Found credentials in environment variables."}}
cleanse_upload: count before cleansing: upload=257977, fileupload=22135445
cleanse_upload: try=True, cutoff=2021-12-12: deleted upload=54182, fileupload=9962815
cleanse_upload: try=False, cutoff=2020-01-12: deleted upload=67740, fileupload=4153589

real	11m8.257s
user	0m0.840s
sys	0m0.101s

11m is awesome!

Quick stats:

table original deleted
uploads 257,977 121,922 (47%)
fileuploads 22,135,445 14,116,404 (63%)

That's so silly.

Marking this as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: