1599460 - cycle_data for jobs failing on stage and prototype

Reporter

Description

•

5 years ago

It seems that the deleting jobs can easily fail.
I believe the issue is that we use filter() with submit_time__lt which is not an index and we keep searching to meet the "limit" set by the chunk size.

Thoughts on how to improve this?

Here's a snippet of code that can help you go through the same code path:

from treeherder.model.models import Job
import datetime
chunk_size=5
jobs_max_timestamp = datetime.datetime.now() - datetime.timedelta(days=120)
jobs_chunk = list(Job.objects.filter(submit_time__lt=jobs_max_timestamp).order_by('id').values_list('guid', flat=True)[:chunk_size])
Job.objects.filter(guid__in=jobs_chunk).delete()

I run this statemement and it looses connection:

SELECT * FROM treeherder.job WHERE submit_time < '2019-07-29 04:51:16.000000' ORDER BY id LIMIT 5000;

if I run this instead it will go fast because there:

SELECT * FROM treeherder.job WHERE submit_time < '2019-07-29 06:51:16.000000' ORDER BY id LIMIT 5000;

Notice that the WHERE statement only differs by 2 hours. The first statement fails because it continues doing a full table search for rows that would meet that criteria but it will never succeed.

On another note, we're not cleaning up old pushes at all. I see pushes from 2015.

If we decoupled Performance data from pushes we could simply iterate through pushes older than 4 months and do cascade removal.

# Output modified to simplify conversation
# An iteration
7:03:19 [treeherder.model.models:426] Pruning jobs: chunk of 5000 older than Jul 29 2019
7:03:19 [treeherder.model.models:436] deleting FailureLines
7:03:23 [treeherder.model.models:442] delete jobs
# Another iteration only 8 seconds after starting to delete jobs
7:03:31 [treeherder.model.models:426] Pruning jobs: chunk of 5000 older than Jul 29 2019
            [treeherder.model.models:436] deleting FailureLines
            [treeherder.model.models:442] delete jobs
# One last iteration which fails after 3 minutes
7:03:43 treeherder-stage app/scheduler.9126 WARNING:delete jobs
7:06:51 [treeherder.model.management.commands.cycle_data:51] Error running cycle_data: (3024, 'Query execution was interrupted, maximum statement execution time exceeded')

Sarah Clements [:sclements]

Comment 1

•

5 years ago

•

Edited

On another note, we're not cleaning up old pushes at all. I see pushes from 2015.

If we decoupled Performance data from pushes we could simply iterate through pushes older than 4 months and do cascade removal.

This ties into the discussion of how long to keep alerts from the PerformanceAlertSummary table, which has foreign keys on the Push table (and we aren't currently deleting any data for). PerformanceDatum has has a foreign key on Push and we keep that data for a year. Perhaps we should file a new bug for deleting old push data and need info Dave and Ionut so we can decide on a definitive data retention policy.

Sarah Clements [:sclements]

Comment 2

•

5 years ago

It's good that we're logging this type of error and finally seeing it, thanks to the recent changes to cycle_data in bug 1597136. Maybe :camd has some ideas about the issue you mentioned above since he authored the recent improvements.

Armen [:armenzg]

Reporter

Comment 3

•

5 years ago

Sarah, could you please file such bug since you have the most context? Thanks!

ekyle, camd: Could you please weigh in this bug?

ekyle, I'm specifically interested to see if you have a solution on how to handle that using LIMIT causes us to search the full table when there's less rows at the beginning of the table than the specified limit.

Thank you all!

Flags: needinfo?(klahnakoski)

Flags: needinfo?(cdawson)

Armen [:armenzg]

Reporter

Updated

•

5 years ago

Blocks: 1599095

Sarah Clements [:sclements]

Comment 4

•

5 years ago

(In reply to Armen [:armenzg] from comment #3)

Sarah, could you please file such bug since you have the most context? Thanks!

Filed bug 1599859.

Cameron Dawson [:camd]

Comment 5

•

5 years ago

I merged https://github.com/mozilla/treeherder/pull/5669 a while ago which is related to this. This did some optimization there, but Kyle had some good ideas on how to optimize the queries for this further. To make use of the index that contains that submit_time field.

He was going to do some of this work in Bug 1597368.

Flags: needinfo?(cdawson)

Kyle Lahnakoski [:ekyle]

Assignee

Comment 6

•

5 years ago

This is a known problem with https://github.com/mozilla/treeherder/pull/5669 ; it is failing, and being reported, as expected.

The lack of an index forces a full table scan: If there are more records than LIMIT, then the query is probably fast because old records are scanned first. If the number of records to delete is less than LIMIT, then the whole table is scanned, which takes too long.

camd added the try/catch so this error would show itself in the logs, and we are aware of the failure.

Flags: needinfo?(klahnakoski)

Kyle Lahnakoski [:ekyle]

Assignee

Updated

•

5 years ago

Assignee: nobody → klahnakoski

Kyle Lahnakoski [:ekyle]

Assignee

Updated

•

5 years ago

Comment 7

•

5 years ago

Thanks for looking into this!

Armen [:armenzg]

Reporter

Comment 8

•

5 years ago

I think this gets fixed by bug 1597368.
Would you agree?

Kyle Lahnakoski [:ekyle]

Assignee

Comment 9

•

5 years ago

yes, i agree

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

cycle_data for jobs failing on stage and prototype

Categories

(Tree Management :: Treeherder: Infrastructure, defect)

Tracking

(Not tracked)

People

(Reporter: armenzg, Assigned: ekyle)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Updated

Updated

Comment 7

Comment 8

Comment 9