Closed Bug 1599460 Opened 5 years ago Closed 5 years ago

cycle_data for jobs failing on stage and prototype

Categories

(Tree Management :: Treeherder: Infrastructure, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: ekyle)

References

(Blocks 1 open bug)

Details

It seems that the deleting jobs can easily fail.
I believe the issue is that we use filter() with submit_time__lt which is not an index and we keep searching to meet the "limit" set by the chunk size.

Thoughts on how to improve this?

Here's a snippet of code that can help you go through the same code path:

from treeherder.model.models import Job
import datetime
chunk_size=5
jobs_max_timestamp = datetime.datetime.now() - datetime.timedelta(days=120)
jobs_chunk = list(Job.objects.filter(submit_time__lt=jobs_max_timestamp).order_by('id').values_list('guid', flat=True)[:chunk_size])
Job.objects.filter(guid__in=jobs_chunk).delete()

I run this statemement and it looses connection:

SELECT * FROM treeherder.job WHERE submit_time < '2019-07-29 04:51:16.000000' ORDER BY id LIMIT 5000;

if I run this instead it will go fast because there:

SELECT * FROM treeherder.job WHERE submit_time < '2019-07-29 06:51:16.000000' ORDER BY id LIMIT 5000;

Notice that the WHERE statement only differs by 2 hours. The first statement fails because it continues doing a full table search for rows that would meet that criteria but it will never succeed.

On another note, we're not cleaning up old pushes at all. I see pushes from 2015.

If we decoupled Performance data from pushes we could simply iterate through pushes older than 4 months and do cascade removal.

# Output modified to simplify conversation
# An iteration
7:03:19 [treeherder.model.models:426] Pruning jobs: chunk of 5000 older than Jul 29 2019
7:03:19 [treeherder.model.models:436] deleting FailureLines
7:03:23 [treeherder.model.models:442] delete jobs
# Another iteration only 8 seconds after starting to delete jobs
7:03:31 [treeherder.model.models:426] Pruning jobs: chunk of 5000 older than Jul 29 2019
            [treeherder.model.models:436] deleting FailureLines
            [treeherder.model.models:442] delete jobs
# One last iteration which fails after 3 minutes
7:03:43 treeherder-stage app/scheduler.9126 WARNING:delete jobs
7:06:51 [treeherder.model.management.commands.cycle_data:51] Error running cycle_data: (3024, 'Query execution was interrupted, maximum statement execution time exceeded') 

On another note, we're not cleaning up old pushes at all. I see pushes from 2015.

If we decoupled Performance data from pushes we could simply iterate through pushes older than 4 months and do cascade removal.

This ties into the discussion of how long to keep alerts from the PerformanceAlertSummary table, which has foreign keys on the Push table (and we aren't currently deleting any data for). PerformanceDatum has has a foreign key on Push and we keep that data for a year. Perhaps we should file a new bug for deleting old push data and need info Dave and Ionut so we can decide on a definitive data retention policy.

It's good that we're logging this type of error and finally seeing it, thanks to the recent changes to cycle_data in bug 1597136. Maybe :camd has some ideas about the issue you mentioned above since he authored the recent improvements.

Sarah, could you please file such bug since you have the most context? Thanks!

ekyle, camd: Could you please weigh in this bug?

ekyle, I'm specifically interested to see if you have a solution on how to handle that using LIMIT causes us to search the full table when there's less rows at the beginning of the table than the specified limit.

Thank you all!

Flags: needinfo?(klahnakoski)
Flags: needinfo?(cdawson)
Blocks: 1599095

(In reply to Armen [:armenzg] from comment #3)

Sarah, could you please file such bug since you have the most context? Thanks!

Filed bug 1599859.

I merged https://github.com/mozilla/treeherder/pull/5669 a while ago which is related to this. This did some optimization there, but Kyle had some good ideas on how to optimize the queries for this further. To make use of the index that contains that submit_time field.

He was going to do some of this work in Bug 1597368.

Flags: needinfo?(cdawson)

This is a known problem with https://github.com/mozilla/treeherder/pull/5669 ; it is failing, and being reported, as expected.

The lack of an index forces a full table scan: If there are more records than LIMIT, then the query is probably fast because old records are scanned first. If the number of records to delete is less than LIMIT, then the whole table is scanned, which takes too long.

camd added the try/catch so this error would show itself in the logs, and we are aware of the failure.

Flags: needinfo?(klahnakoski)
Assignee: nobody → klahnakoski
See Also: → 1597368

Thanks for looking into this!

I think this gets fixed by bug 1597368.
Would you agree?

yes, i agree

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.