Closed Bug 1826417 Opened 2 years ago Closed 1 years ago

Investigate if expired artifacts are correctly removed

Categories

(Taskcluster :: Operations and Service Requests, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: marco, Assigned: yarik)

References

Details

Attachments

(1 file)

target.crashreporter-symbols-full.tar.zst was made to expire more quickly as part of bug 1790453, and looking at a recent task we see that the expiration is indeed shorter (on https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/KHq8rOLsSBmGjRh84vOVQw/artifacts for this file it is 2023-04-18T19:13:48.598Z instead of the normal 2024-04-03T19:13:48.598Z for the other artifacts).

Unfortunately if we compare the data from June 2022 (https://console.cloud.google.com/bigquery?sq=586736774393:03952067b9d84e8e86ffb78792114d52) to March 2023 (https://console.cloud.google.com/bigquery?sq=513375772973:cc3c9c69c4774b739aa182e77e49ef64), the total size of these artifacts seems to have increased.

Is it possible we are not correctly removing expired artifacts?

Looking at this dashboard: https://earthangel-b40313e5.influxcloud.net/d/sowO93E7k/taskcluster-artifact-storage?orgId=1&from=now-90d&to=now we can see that expireArtifacts job is running daily. But also the total number of artifacts keep growing.

There is a chance that this background task does not remove all of the artifacts during its run and those keep piling up.

Looking at service logs:

resource.type="k8s_container"
resource.labels.project_id="moz-fx-taskcluster-prod-4b87"
resource.labels.cluster_name="taskcluster-firefoxcitc-v1"
resource.labels.namespace_name="firefoxcitc-taskcluster"
jsonPayload.Type="monitor.periodic"
jsonPayload.Fields.name="expire-artifacts"

in the last 30 days there is not a single record.. which means that expire-artifacts background job never finishes

I don't have access to bigquery and am not sure if it is possible to see what expiry date those artifacts have, and if we can see the number of those artifacts who should have been expired and removed.

So I think we might have a bigger issue.
I managed to export artifacts by expiry date from db directly (just took about 30min to aggregate)

 select date_trunc('day', expires) as expires_day, count(*) 
 from queue_artifacts group by date_trunc('day', expires);

Also exported it to google docs

Those show that we should have deleted/expired 102.000.000 artifacts until today already!
And for the rest of 2023 we'd expire 180.000.000 additionally.

And based on the grafana S3 Delete requests we are doing somewhere 600.000-900.000 delete requests per day.

So we are quite behind the schedule and we should consider some other expire approaches

Looking at SQL queries and timings it looks like we only manage to do about 9000 queries a day, each returning 100 records. Which matches the average 900.000 deleteObject requests a day. So we are likely limited by our db rather than AWS

Before we try something radical, I would first try to increase page size and use s3.deleteObjects instead, which allows removal of multiple objects (up to 1000) in one call.
Plus I will add more logging to the expire-artifacts to see better the progress.

I found a way how to query database more efficiently for our use case.

existing expire-artifacts was using get_queue_artifacts_paginated function which was initially suited for the fetching of artifacts of a single task, but later was probably extended to also find the expired ones. It queries 4 columns, three of which are in primary index, but expires is not.

if we do simply

select *  from queue_artifacts
where expires < '2023-01-01T00:00:00Z'
limit 1000;

then postgres uses single sequential scan and returns first found rows very quickly, 1000 rows within 200-300ms

It works fast on the queries like this, where expires is in the past, and the rows are more likely to be located in the beginning of the scanned data. It is less efficient if we want to use it to find records in far future, but we don’t need this at the moment

Planned changes:

  • add this query for expire-artifacts job
  • increase page size from 100 to 1000
  • add logging to see how many records are being processed

After we deploy that changes we would be able to tell how much time will be needed to catch up with removals, or if we still need to adjust page size or query itself

Fix was merged with https://github.com/taskcluster/taskcluster/pull/6172
And released in v49.1.0
fix will be tested on community first

Community-TC deployment seems to be working fast.
Before the deployment job was running for somewhere 450-500 seconds, now it finishes under 50s while removing 20k-25k of objects.

\o/

updated expire-artifacts job removed all of the backlog artifacts in just two days which is great :)

and now finishes in 4-10 hours daily, removing somewhere around 1m-2m artifacts

Yarik, can we close this as FIXED?

Assignee: nobody → ykurmyza
Flags: needinfo?(ykurmyza)

Yes, let's close it. Thanks for reminding.

Status: NEW → RESOLVED
Closed: 1 years ago
Flags: needinfo?(ykurmyza)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: