Closed
Bug 1245080
Opened 9 years ago
Closed 8 years ago
Review the data consumption of the failureline table
Categories
(Tree Management :: Treeherder: Data Ingestion, defect, P2)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: jgraham)
References
Details
Attachments
(1 file)
The size of some of the autoclassification tables came up recently re the schema migration problems, I think we should:
1) Check that we're not storing data that we shouldn't be, in the failureline table:
* Error lines for green jobs (though admittedly we do perform normal parsing on green jobs so as to find false negatives, but maybe this is redundant with structured logs?)
* We should cap the number of error lines per job (eg at 20 errors max, if we don't already) to prevent really spamming jobs from causing us issues
* We should analyse false positives/lines that provide little value (eg "Return code 1") and either fix them upstream in the harness or at least blacklist
2) Add data expiration for that table
Assignee | ||
Comment 1•9 years ago
|
||
I agree with your general thrust here, but in regard to the specific points:
* I think we store failure lines for any job, although I wonder if there are cases where we get a non-empty error summary and the job is still green. These files are less noisy than the general logs, so it's more likely that this case doesn't exist, or at least isn't significant.
* I believe the number of lines per job is currently capped at 30
* The "lines that provide little value" aren't in this log (typically); it just contains failed tests and CRITICAL/ERROR level log messages from the harness itself
I think the problem is more likely to be that we a) have an awful lot of lines and b) store all the error messages and crash stacks in full, duplicated for each line. If we are seeing a lot of entirely redundant data (and that needs investigation), it might be sensible to move those out into their own table so we only store unique entries, and, if needed, to consider truncating those fields.
Reporter | ||
Comment 2•9 years ago
|
||
James/Cameron, please could one of you take ownership of this bug? :-)
It's just we almost ran out of disk space (bug 1245079) due to the increased storage requirements of this feature - and I'm not the best person to investigate the above.
Flags: needinfo?(james)
Flags: needinfo?(cdawson)
Assignee | ||
Comment 3•9 years ago
|
||
So, I think there should be at least one part to this solution:
1) remove failure lines after 60 days or so
and possibly a ssecond:
2) Change the failure-line storage to de-dupe identical error messages and stack traces. It's not clear if this will be a win, but it's a measurable question.
I don't really know what the best way to implement 1) is (I assume there is already some existing framework we can use) and I don't think I have enough db access to estimate the available savings due to 2).
Flags: needinfo?(james)
Reporter | ||
Comment 4•9 years ago
|
||
(In reply to James Graham [:jgraham] from comment #3)
> I don't really know what the best way to implement 1) is (I assume there is
> already some existing framework we can use) and I don't think I have enough
> db access to estimate the available savings due to 2).
You have access to the DB, connect using the values from:
$ heroku config | grep DATABASE
And since it's AWS, you'll need the ca certs, found in the repo in deployment/aws/
eg if using the CLI:
$ mysql -u th_admin -h treeherder-heroku.REDACTED.us-east-1.rds.amazonaws.com --ssl-ca=deployment/aws/combined-ca-bundle.pem --ssl-verify-server-cert
(and enter the password when prompted)
(These steps are in bug 1165259, and will be added to read the docs soon)
Reporter | ||
Comment 5•9 years ago
|
||
(In reply to James Graham [:jgraham] from comment #3)
> So, I think there should be at least one part to this solution:
> 1) remove failure lines after 60 days or so
This sounds like it will really help us - if you're happy with losing this data?
It would likely best be added to the existing cycle_data task here:
https://github.com/mozilla/treeherder/blob/7e9b174b13089638f0ddf6a10efbac2cd2cdf792/treeherder/model/derived/jobs.py#L511
Assignee | ||
Comment 6•9 years ago
|
||
Yeah, it will set a horizon beyond which we won't be able to match intermittents (because matching is based on prototypical matches). It also means that if you look at old jobs they will appear weird, but neither of those seems like a deal breaker if we can't afford to store more data.
Reporter | ||
Comment 7•9 years ago
|
||
The new RDS instance being set up for the Heroku move has 500GB provisioned, which is much more sensible. If/when we move from RDS MySQL to Aurora, it auto-scales in 10GB increments with no downtime/impact, which makes things like this less of a problem.
We should use as much data as we need, if we need it, my main concern is that we'd increased the rate of consumption significantly without realising/at least running some projections :-)
Comment 8•9 years ago
|
||
So, James, it sounds like you can take this one? Let me know if you need a hand, but otherwise I'll leave it in your capable hands. :)
Flags: needinfo?(cdawson)
Comment 9•9 years ago
|
||
Assignee | ||
Comment 10•9 years ago
|
||
Comment on attachment 8727361 [details] [review]
[treeherder] mozilla:jgraham/failure_line_cycle > mozilla:master
This seems to pass the tests at least. It ties the lifetime of the failure_line rows to that of the corresponding jobs, which gives 120 day retention at the moment, but I think it makes more sense than deleting the lines independently of the jobs.
Attachment #8727361 -
Flags: review?(emorley)
Reporter | ||
Updated•9 years ago
|
Attachment #8727361 -
Flags: review?(emorley) → review+
Reporter | ||
Updated•9 years ago
|
Assignee: nobody → james
Comment 11•9 years ago
|
||
Commits pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/326b918e35d65dbdca1e8984c89ff3562fc66a88
Bug 1245080 - Remove failure lines with corresponding jobs
When jobs are cycled, also remove the failure_line rows for
those jobs, to limit data consumption of the failure_line
table.
https://github.com/mozilla/treeherder/commit/ea9b033f48b80a7bfb3c0ed850e86254d2933d06
Merge pull request #1333 from mozilla/jgraham/failure_line_cycle
Bug 1245080 - Remove failure lines with corresponding jobs
Reporter | ||
Updated•9 years ago
|
Reporter | ||
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•