Closed Bug 1347721 Opened 8 years ago Closed 6 years ago

Data in the FailureLine table isn't being expired

Categories

(Tree Management :: Treeherder: Infrastructure, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

The first row in the table has a created date of 2016-02-01 04:07:28. The table is 86GB, 12.7 million rows. We really should hook it up to cycle_data.
Mentor: cdawson
Keywords: good-first-bug
Whiteboard: [lang=py]
Hi! Can I work on this?
I am new here. I want to work on this bug. Could you give me any more technical information about this?
Flags: needinfo?(cdawson)
Hi there ydidwania-- Yes! :) We have a process called "Cycle Data" that should be doing this work. Please look in the file: https://github.com/mozilla/treeherder/blob/master/treeherder/model/models.py#L611-L611 The code here SHOULD be deleting those FailureLines that don't have an associated job_guid value. So the issue here is we need to debug why they're not getting deleted. I'd suggest starting by writing a unit test to ensure that the records can be deleted there. The unit tests for this feature are here: https://github.com/mozilla/treeherder/blob/master/tests/model/test_cycle_data.py#L23-L23 It doesn't look like there's one for checking that FailureLines are deleted. If you get blocked, please let me know if there's anything I can do to help. :) -Cam
Flags: needinfo?(cdawson)
Bug 1347721 . Here's the test to ensure FailureLines can be deleted . https://pastebin.com/x7gyZDEu . It ran successfully.What should I do next? Also could you tell me about ElasticSearch? It was used in Cycle data process.
Flags: needinfo?(cdawson)
I read the cycle_data function in https://github.com/mozilla/treeherder/blob/master/treeherder/model/models.py#L611-L611. I didn't understand what ELASTIC_SEARCH is.Apart from that, it filters Jobs which have a job_max_timestamp beyond a certain cycle_interval and deletes the FailureLine which have job_guid associated to them and also deletes the jobs as well.
So it ends up this was a poor choice for a good first bug. My bad. This is more of a data issue that requires being able to query the DB directly in order to ascertain what was really going on. As it ends up, the FailureLine records are only deleted when their corresponding Job is deleted. But there are LOTS of records that no longer have a Job (orphans). If I do a query for failurelines newer than 2017-01-01, then the earliest is 2017-04-05. So I believe we are deleting the FailureLines that are newly expiring along with their corresponding jobs just fine. We really just need to delete the orphaned records. The JobLog table has the same issue. That being said, I think I'll remove this from a "good first bug" and propose a few DB queries that will delete the orphaned records DELETE FROM failure_line fl LEFT JOIN job ON job.guid=fl.job_guid WHERE job.guid is null; and DELETE FROM treeherder.job_log LEFT JOIN job ON job_log.job_id=job.id WHERE job.id is null; Ed, would you double-check me on the SQL here? :)
Flags: needinfo?(cdawson) → needinfo?(emorley)
Mentor: cdawson
Keywords: good-first-bug
Whiteboard: [lang=py]
I think the reason for the orphaned rows might be the various bugs in cycle data, some of which were fixed a while back. James, would you mind taking a look at the SQL to delete them above?
Flags: needinfo?(emorley) → needinfo?(james)
Unless there's some SQL syntax I don't know (there is), you probably mean `DELETE FROM failure_line as fl`. Apart from that it looks reasonable, but you should convert them into select queries first and check that the rows you get back look sensible e.g. SELECT fl.* FROM failure_line as fl LEFT JOIN job ON job.guid=fl.job_guid WHERE job.guid is null LIMIT 100 ORDER BY id DESC; (selecting the highest ids should ensure that anything you don't inspect is even older, and so also has a job deleted). Should cleaning up such orphans be a regular part of cycle_data. It would ordinarily be a noop, but it seems reasonable to assume that some earlier cleanup may have been incomplete since iirc it's not done atomically.
Flags: needinfo?(james)
Component: Treeherder → Treeherder: Infrastructure
(In reply to Ed Morley [:emorley] from comment #0) > The table is 86GB, 12.7 million rows. In the 8 months since the OP, the failure line table has now increased to 77 million rows and 320GB.
I'm going to work on this as part of fixing Bug 1409679. The GraphQL queries are running slow at times and they spend a lot of time in this table. In order to diagnose the slowness, I'd like to start by getting this table under control. Unfortunately, running a raw SQL query to delete the data won't quite cut it because the ON DELETE settings for tables like failure_match is "restrict" rather than cascade. The cascading happens in the django code instead of in the DB. So I think the best solution at this point is to create a management command that will do this. I'll use a chunk of raw sql to get the job guids that are missing
Blocks: 1409679
Oops, accidental save... continued: I'll use a chunk of raw sql to get the job guids that are missing from the ``job`` table (in CHUNK_SIZE chunks). Then do the delete on the FailureLine model for those. We can set an interval to run this (and increase the chunk size as seems appropriate). This seems like a good candidate to try using Heroku scheduler. There are plenty of offending records on prototype we can test with. :)
Attachment #8935958 - Flags: review?(emorley)
Assignee: nobody → cdawson
Status: NEW → ASSIGNED
Comment on attachment 8935958 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3032 Sorry for the delay reviewing this! Over the vacation I manually ran cycle_data which helped clean up some of these, which possibly changes things slightly. I've left some comments on the PR :-)
Attachment #8935958 - Flags: review?(emorley)

I'm not going to get to this anytime soon.

Assignee: cdawson → nobody
Status: ASSIGNED → NEW
Priority: P2 → P3

After taking another look, the oldest records in this table are now 4 months old which coincides with the cycle_data age for jobs, etc. So cycle_data is not working properly for this table. Closing.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: