Closed Bug 1347721 Opened 8 years ago Closed 6 years ago

Data in the FailureLine table isn't being expired

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: emorley, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3032 7 years ago GitHub Bugzilla PR Linker 47 bytes, text/x-github-pull-request		Details \| Review

Ed Morley [:emorley]

Reporter

Description

•

8 years ago

The first row in the table has a created date of 2016-02-01 04:07:28. The table is 86GB, 12.7 million rows. We really should hook it up to cycle_data.

Cameron Dawson [:camd]

Updated

•

8 years ago

Mentor: cdawson

Keywords: good-first-bug

Whiteboard: [lang=py]

Yashvardhan Didwania [:ydidwania]

Comment 1

•

8 years ago

Hi! Can I work on this?

Yashvardhan Didwania [:ydidwania]

Comment 2

•

8 years ago

I am new here. I want to work on this bug. Could you give me any more technical information about this?

Flags: needinfo?(cdawson)

Cameron Dawson [:camd]

Comment 3

•

8 years ago

Hi there ydidwania-- Yes! :) We have a process called "Cycle Data" that should be doing this work. Please look in the file: https://github.com/mozilla/treeherder/blob/master/treeherder/model/models.py#L611-L611 The code here SHOULD be deleting those FailureLines that don't have an associated job_guid value. So the issue here is we need to debug why they're not getting deleted. I'd suggest starting by writing a unit test to ensure that the records can be deleted there. The unit tests for this feature are here: https://github.com/mozilla/treeherder/blob/master/tests/model/test_cycle_data.py#L23-L23 It doesn't look like there's one for checking that FailureLines are deleted. If you get blocked, please let me know if there's anything I can do to help. :) -Cam

Flags: needinfo?(cdawson)

Yashvardhan Didwania [:ydidwania]

Comment 4

•

8 years ago

Bug 1347721 . Here's the test to ensure FailureLines can be deleted . https://pastebin.com/x7gyZDEu . It ran successfully.What should I do next? Also could you tell me about ElasticSearch? It was used in Cycle data process.

Flags: needinfo?(cdawson)

Yashvardhan Didwania [:ydidwania]

Comment 5

•

8 years ago

I read the cycle_data function in https://github.com/mozilla/treeherder/blob/master/treeherder/model/models.py#L611-L611. I didn't understand what ELASTIC_SEARCH is.Apart from that, it filters Jobs which have a job_max_timestamp beyond a certain cycle_interval and deletes the FailureLine which have job_guid associated to them and also deletes the jobs as well.

Cameron Dawson [:camd]

Comment 6

•

8 years ago

So it ends up this was a poor choice for a good first bug. My bad. This is more of a data issue that requires being able to query the DB directly in order to ascertain what was really going on. As it ends up, the FailureLine records are only deleted when their corresponding Job is deleted. But there are LOTS of records that no longer have a Job (orphans). If I do a query for failurelines newer than 2017-01-01, then the earliest is 2017-04-05. So I believe we are deleting the FailureLines that are newly expiring along with their corresponding jobs just fine. We really just need to delete the orphaned records. The JobLog table has the same issue. That being said, I think I'll remove this from a "good first bug" and propose a few DB queries that will delete the orphaned records DELETE FROM failure_line fl LEFT JOIN job ON job.guid=fl.job_guid WHERE job.guid is null; and DELETE FROM treeherder.job_log LEFT JOIN job ON job_log.job_id=job.id WHERE job.id is null; Ed, would you double-check me on the SQL here? :)

Flags: needinfo?(cdawson) → needinfo?(emorley)

Cameron Dawson [:camd]

Updated

•

8 years ago

Mentor: cdawson

Keywords: good-first-bug

Whiteboard: [lang=py]

Ed Morley [:emorley]

Reporter

Comment 7

•

8 years ago

I think the reason for the orphaned rows might be the various bugs in cycle data, some of which were fixed a while back. James, would you mind taking a look at the SQL to delete them above?

Flags: needinfo?(emorley) → needinfo?(james)

James Graham [:jgraham]

Comment 8

•

8 years ago

Unless there's some SQL syntax I don't know (there is), you probably mean `DELETE FROM failure_line as fl`. Apart from that it looks reasonable, but you should convert them into select queries first and check that the rows you get back look sensible e.g. SELECT fl.* FROM failure_line as fl LEFT JOIN job ON job.guid=fl.job_guid WHERE job.guid is null LIMIT 100 ORDER BY id DESC; (selecting the highest ids should ensure that anything you don't inspect is even older, and so also has a job deleted). Should cleaning up such orphans be a regular part of cycle_data. It would ordinarily be a noop, but it seems reasonable to assume that some earlier cleanup may have been incomplete since iirc it's not done atomically.

Flags: needinfo?(james)

Ed Morley [:emorley]

Reporter

Updated

•

7 years ago

Component: Treeherder → Treeherder: Infrastructure

Ed Morley [:emorley]

Reporter

Comment 9

•

7 years ago

(In reply to Ed Morley [:emorley] from comment #0) > The table is 86GB, 12.7 million rows. In the 8 months since the OP, the failure line table has now increased to 77 million rows and 320GB.

Cameron Dawson [:camd]

Comment 10

•

7 years ago

I'm going to work on this as part of fixing Bug 1409679. The GraphQL queries are running slow at times and they spend a lot of time in this table. In order to diagnose the slowness, I'd like to start by getting this table under control. Unfortunately, running a raw SQL query to delete the data won't quite cut it because the ON DELETE settings for tables like failure_match is "restrict" rather than cascade. The cascading happens in the django code instead of in the DB. So I think the best solution at this point is to create a management command that will do this. I'll use a chunk of raw sql to get the job guids that are missing

Blocks: 1409679

Cameron Dawson [:camd]

Comment 11

•

7 years ago

Oops, accidental save... continued: I'll use a chunk of raw sql to get the job guids that are missing from the ``job`` table (in CHUNK_SIZE chunks). Then do the delete on the FailureLine model for those. We can set an interval to run this (and increase the chunk size as seems appropriate). This seems like a good candidate to try using Heroku scheduler. There are plenty of offending records on prototype we can test with. :)

GitHub Bugzilla PR Linker

Comment 12

•

7 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3032 — Details

Cameron Dawson [:camd]

Updated

•

7 years ago

Attachment #8935958 - Flags: review?(emorley)

Ed Morley [:emorley]

Reporter

Updated

•

7 years ago

Assignee: nobody → cdawson

Status: NEW → ASSIGNED

Ed Morley [:emorley]

Reporter

Comment 13

•

7 years ago

Comment on attachment 8935958 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3032 Sorry for the delay reviewing this! Over the vacation I manually ran cycle_data which helped clean up some of these, which possibly changes things slightly. I've left some comments on the PR :-)

Attachment #8935958 - Flags: review?(emorley)

Cameron Dawson [:camd]

Comment 14

•

6 years ago

I'm not going to get to this anytime soon.

Assignee: cdawson → nobody

Status: ASSIGNED → NEW

Priority: P2 → P3

Cameron Dawson [:camd]

Comment 15

•

6 years ago

After taking another look, the oldest records in this table are now 4 months old which coincides with the cycle_data age for jobs, etc. So cycle_data is not working properly for this table. Closing.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.