Closed
Bug 1182282
Opened 9 years ago
Closed 9 years ago
We're storing duplicate performance data in perfherder
Categories
(Tree Management :: Perfherder, defect)
Tree Management
Perfherder
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: wlach, Assigned: wlach)
References
Details
Attachments
(1 file)
:jmaher pointed out that there's a lot of duplicate data in here: https://treeherder.allizom.org/perf.html#/comparesubtest?originalProject=fx-team&originalRevision=9f307a1aba5c&newProject=fx-team&newRevision=259dcb2a811e&originalSignature=829c89815f01383f9ffc0ccef6331768d9bbf6fb&newSignature=829c89815f01383f9ffc0ccef6331768d9bbf6fb On closer inspection, it looks like we're sometimes storing multiple numbers for the same job id. On talking with :camd, it appears that we don't have strong locking in the log parser -- it's quite possible for us to parse logs multiple times if it is taking a while to process a particular log (since we only abort log parsing if the parse status of a job is marked as "parsed", which only happens at the very end). We should fix our function that populates performance data to abort if data for the current job id already exists. We should also clean up the old duplicate data, as it's consuming a bunch of extra space (a quick check on one signature for the yearly data showed 200 extra datapoints). We also may want to set an interim state of "parsing" or something to reduce the instance of logs being parsed more than once. I don't think it's possible to completely guard against the same data being submitted more than once without a lot of extra complexity (and even that wouldn't help with things like taskcluster), but we should probably avoid doing extra work where we can.
Comment 1•9 years ago
|
||
Nice spot! Thank you :-)
Updated•9 years ago
|
Assignee | ||
Comment 2•9 years ago
|
||
(In reply to William Lachance (:wlach) from comment #0) > We also may want to set an interim state of "parsing" or something to reduce > the instance of logs being parsed more than once. I don't think it's > possible to completely guard against the same data being submitted more than > once without a lot of extra complexity (and even that wouldn't help with > things like taskcluster), but we should probably avoid doing extra work > where we can. It looks like bug 1178240 should basically do this.
Comment 3•9 years ago
|
||
That's different afaict; we don't reparse the logs in ingestion, we just try to put them in the jobs table/objectstore and bail if we cannot.
Comment 4•9 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #3) > That's different afaict; we don't reparse the logs in ingestion, we just try > to put them in the jobs table/objectstore and bail if we cannot. Scheduling of the log parse happens here: https://github.com/mozilla/treeherder/blob/0f3d1b2bcf8700070e7e99892bf1990528811bd0/treeherder/model/derived/jobs.py#L1889 And the function _load_log_urls() is called from here: https://github.com/mozilla/treeherder/blob/0f3d1b2bcf8700070e7e99892bf1990528811bd0/treeherder/model/derived/jobs.py#L1349 And before then, any jobs that already exist in the same state should have been filtered out: https://github.com/mozilla/treeherder/blob/0f3d1b2bcf8700070e7e99892bf1990528811bd0/treeherder/model/derived/jobs.py#L1170-L1171 The main benefit of bug 1178240 is that we won't even have to filter those jobs out (by hitting the DB), since we will have skipped them in buildapi.py to start with (saving the whole round trip to objectstore/jobs API etc).
Assignee | ||
Comment 5•9 years ago
|
||
I'm not really sure who's best to review this, so I guess I'll choose camd. This is a pretty straightforward solution to the problem we discussed yesterday.
Attachment #8632166 -
Flags: review?(cdawson)
Comment 7•9 years ago
|
||
Comment on attachment 8632166 [details] [review] PR This looks good to me. Would you open a bug for the issue in Comment 2 though? I think more solid handling for duplicates would be good.
Attachment #8632166 -
Flags: review?(cdawson) → review+
Comment 8•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/61904aaffaef2e44846ecf8c9e4f163dc149a970 Bug 1182282 - Handle submissions of duplicate performance data It appears that on occasion we parse a log more than once, which resulted in duplicate performance series going into the database. Let's be resilient about this by not inserting duplicate jobs into the database (we always attach a unique job id to every datapoint, so there's no chance of accidentally removing entries which happen to have the same performance numbers)
Assignee | ||
Comment 9•9 years ago
|
||
Filed bug 1183246 to deal with the duplicate log parsing issue.
See Also: → 1183246
Assignee | ||
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Comment 10•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/b8106c9e118eb2e6d49ebe6580f8a52f18985302 Bug 1182282 - Fix one more case where dupe performance data could happen If we don't sort the elements of the series, the same element dictionary could have a different order. This seems to be the case in a small number of cases.
Comment 11•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/9e205becea90db447bae050c5df9aa829a906621 Bug 1182282 - Fix bustage in previous commit
Comment 12•9 years ago
|
||
(In reply to Treeherder Bugbot from comment #10) > If we don't sort the elements of the series, the same element dictionary > could have a different order. This seems to be the case in a small number > of cases. Care to explain what does this mean or imply?
Comment 13•9 years ago
|
||
It's covering the case mentioned by http://stackoverflow.com/questions/9427163/remove-duplicate-dict-in-list-in-python/9427216#comment11927336_9427216
You need to log in
before you can comment on or make changes to this bug.
Description
•