Closed Bug 1245080 Opened 9 years ago Closed 8 years ago

Review the data consumption of the failureline table

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: jgraham)

References

Details

Attachments

(1 file)

The size of some of the autoclassification tables came up recently re the schema migration problems, I think we should: 1) Check that we're not storing data that we shouldn't be, in the failureline table: * Error lines for green jobs (though admittedly we do perform normal parsing on green jobs so as to find false negatives, but maybe this is redundant with structured logs?) * We should cap the number of error lines per job (eg at 20 errors max, if we don't already) to prevent really spamming jobs from causing us issues * We should analyse false positives/lines that provide little value (eg "Return code 1") and either fix them upstream in the harness or at least blacklist 2) Add data expiration for that table
I agree with your general thrust here, but in regard to the specific points: * I think we store failure lines for any job, although I wonder if there are cases where we get a non-empty error summary and the job is still green. These files are less noisy than the general logs, so it's more likely that this case doesn't exist, or at least isn't significant. * I believe the number of lines per job is currently capped at 30 * The "lines that provide little value" aren't in this log (typically); it just contains failed tests and CRITICAL/ERROR level log messages from the harness itself I think the problem is more likely to be that we a) have an awful lot of lines and b) store all the error messages and crash stacks in full, duplicated for each line. If we are seeing a lot of entirely redundant data (and that needs investigation), it might be sensible to move those out into their own table so we only store unique entries, and, if needed, to consider truncating those fields.
James/Cameron, please could one of you take ownership of this bug? :-) It's just we almost ran out of disk space (bug 1245079) due to the increased storage requirements of this feature - and I'm not the best person to investigate the above.
Flags: needinfo?(james)
Flags: needinfo?(cdawson)
So, I think there should be at least one part to this solution: 1) remove failure lines after 60 days or so and possibly a ssecond: 2) Change the failure-line storage to de-dupe identical error messages and stack traces. It's not clear if this will be a win, but it's a measurable question. I don't really know what the best way to implement 1) is (I assume there is already some existing framework we can use) and I don't think I have enough db access to estimate the available savings due to 2).
Flags: needinfo?(james)
(In reply to James Graham [:jgraham] from comment #3) > I don't really know what the best way to implement 1) is (I assume there is > already some existing framework we can use) and I don't think I have enough > db access to estimate the available savings due to 2). You have access to the DB, connect using the values from: $ heroku config | grep DATABASE And since it's AWS, you'll need the ca certs, found in the repo in deployment/aws/ eg if using the CLI: $ mysql -u th_admin -h treeherder-heroku.REDACTED.us-east-1.rds.amazonaws.com --ssl-ca=deployment/aws/combined-ca-bundle.pem --ssl-verify-server-cert (and enter the password when prompted) (These steps are in bug 1165259, and will be added to read the docs soon)
(In reply to James Graham [:jgraham] from comment #3) > So, I think there should be at least one part to this solution: > 1) remove failure lines after 60 days or so This sounds like it will really help us - if you're happy with losing this data? It would likely best be added to the existing cycle_data task here: https://github.com/mozilla/treeherder/blob/7e9b174b13089638f0ddf6a10efbac2cd2cdf792/treeherder/model/derived/jobs.py#L511
Yeah, it will set a horizon beyond which we won't be able to match intermittents (because matching is based on prototypical matches). It also means that if you look at old jobs they will appear weird, but neither of those seems like a deal breaker if we can't afford to store more data.
The new RDS instance being set up for the Heroku move has 500GB provisioned, which is much more sensible. If/when we move from RDS MySQL to Aurora, it auto-scales in 10GB increments with no downtime/impact, which makes things like this less of a problem. We should use as much data as we need, if we need it, my main concern is that we'd increased the rate of consumption significantly without realising/at least running some projections :-)
So, James, it sounds like you can take this one? Let me know if you need a hand, but otherwise I'll leave it in your capable hands. :)
Flags: needinfo?(cdawson)
Comment on attachment 8727361 [details] [review] [treeherder] mozilla:jgraham/failure_line_cycle > mozilla:master This seems to pass the tests at least. It ties the lifetime of the failure_line rows to that of the corresponding jobs, which gives 120 day retention at the moment, but I think it makes more sense than deleting the lines independently of the jobs.
Attachment #8727361 - Flags: review?(emorley)
Attachment #8727361 - Flags: review?(emorley) → review+
Assignee: nobody → james
Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/326b918e35d65dbdca1e8984c89ff3562fc66a88 Bug 1245080 - Remove failure lines with corresponding jobs When jobs are cycled, also remove the failure_line rows for those jobs, to limit data consumption of the failure_line table. https://github.com/mozilla/treeherder/commit/ea9b033f48b80a7bfb3c0ed850e86254d2933d06 Merge pull request #1333 from mozilla/jgraham/failure_line_cycle Bug 1245080 - Remove failure lines with corresponding jobs
Blocks: 1241940
No longer blocks: autostar
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: