Closed Bug 1182282 Opened 10 years ago Closed 10 years ago

We're storing duplicate performance data in perfherder

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: wlach, Assigned: wlach)

References

Details

Attachments

(1 file)

PR 10 years ago William Lachance (:wlach) 46 bytes, text/x-github-pull-request	camd : review+	Details \| Review

William Lachance (:wlach)

Assignee

Description

•

10 years ago

:jmaher pointed out that there's a lot of duplicate data in here: https://treeherder.allizom.org/perf.html#/comparesubtest?originalProject=fx-team&originalRevision=9f307a1aba5c&newProject=fx-team&newRevision=259dcb2a811e&originalSignature=829c89815f01383f9ffc0ccef6331768d9bbf6fb&newSignature=829c89815f01383f9ffc0ccef6331768d9bbf6fb On closer inspection, it looks like we're sometimes storing multiple numbers for the same job id. On talking with :camd, it appears that we don't have strong locking in the log parser -- it's quite possible for us to parse logs multiple times if it is taking a while to process a particular log (since we only abort log parsing if the parse status of a job is marked as "parsed", which only happens at the very end). We should fix our function that populates performance data to abort if data for the current job id already exists. We should also clean up the old duplicate data, as it's consuming a bunch of extra space (a quick check on one signature for the yearly data showed 200 extra datapoints). We also may want to set an interim state of "parsing" or something to reduce the instance of logs being parsed more than once. I don't think it's possible to completely guard against the same data being submitted more than once without a lot of extra complexity (and even that wouldn't help with things like taskcluster), but we should probably avoid doing extra work where we can.

William Lachance (:wlach)

Assignee

Updated

•

10 years ago

Comment 1

•

10 years ago

Nice spot! Thank you :-)

URL: 1179223

Updated

•

10 years ago

Blocks: 1179223

URL: 1179223

William Lachance (:wlach)

Assignee

Comment 2

•

10 years ago

(In reply to William Lachance (:wlach) from comment #0) > We also may want to set an interim state of "parsing" or something to reduce > the instance of logs being parsed more than once. I don't think it's > possible to completely guard against the same data being submitted more than > once without a lot of extra complexity (and even that wouldn't help with > things like taskcluster), but we should probably avoid doing extra work > where we can. It looks like bug 1178240 should basically do this.

Ed Morley [:emorley]

Comment 3

•

10 years ago

That's different afaict; we don't reparse the logs in ingestion, we just try to put them in the jobs table/objectstore and bail if we cannot.

Ed Morley [:emorley]

Comment 4

•

10 years ago

(In reply to Ed Morley [:emorley] from comment #3) > That's different afaict; we don't reparse the logs in ingestion, we just try > to put them in the jobs table/objectstore and bail if we cannot. Scheduling of the log parse happens here: https://github.com/mozilla/treeherder/blob/0f3d1b2bcf8700070e7e99892bf1990528811bd0/treeherder/model/derived/jobs.py#L1889 And the function _load_log_urls() is called from here: https://github.com/mozilla/treeherder/blob/0f3d1b2bcf8700070e7e99892bf1990528811bd0/treeherder/model/derived/jobs.py#L1349 And before then, any jobs that already exist in the same state should have been filtered out: https://github.com/mozilla/treeherder/blob/0f3d1b2bcf8700070e7e99892bf1990528811bd0/treeherder/model/derived/jobs.py#L1170-L1171 The main benefit of bug 1178240 is that we won't even have to filter those jobs out (by hitting the DB), since we will have skipped them in buildapi.py to start with (saving the whole round trip to objectstore/jobs API etc).

William Lachance (:wlach)

Assignee

Comment 5

•

10 years ago

Attached file PR — Details

I'm not really sure who's best to review this, so I guess I'll choose camd. This is a pretty straightforward solution to the problem we discussed yesterday.

Attachment #8632166 - Flags: review?(cdawson)

Cameron Dawson [:camd]

Comment 7

•

10 years ago

Comment on attachment 8632166 [details] [review] PR This looks good to me. Would you open a bug for the issue in Comment 2 though? I think more solid handling for duplicates would be good.

Attachment #8632166 - Flags: review?(cdawson) → review+

Treeherder GitHub Bugbot

Comment 8

•

10 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/61904aaffaef2e44846ecf8c9e4f163dc149a970 Bug 1182282 - Handle submissions of duplicate performance data It appears that on occasion we parse a log more than once, which resulted in duplicate performance series going into the database. Let's be resilient about this by not inserting duplicate jobs into the database (we always attach a unique job id to every datapoint, so there's no chance of accidentally removing entries which happen to have the same performance numbers)

William Lachance (:wlach)

Assignee

Comment 9

•

10 years ago

Filed bug 1183246 to deal with the duplicate log parsing issue.

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Treeherder GitHub Bugbot

Comment 10

•

10 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/b8106c9e118eb2e6d49ebe6580f8a52f18985302 Bug 1182282 - Fix one more case where dupe performance data could happen If we don't sort the elements of the series, the same element dictionary could have a different order. This seems to be the case in a small number of cases.

Treeherder GitHub Bugbot

Comment 11

•

10 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/9e205becea90db447bae050c5df9aa829a906621 Bug 1182282 - Fix bustage in previous commit

Avi Halachmi (:avih)

Comment 12

•

10 years ago

(In reply to Treeherder Bugbot from comment #10) > If we don't sort the elements of the series, the same element dictionary > could have a different order. This seems to be the case in a small number > of cases. Care to explain what does this mean or imply?

Ed Morley [:emorley]

Comment 13

•

10 years ago

It's covering the case mentioned by http://stackoverflow.com/questions/9427163/remove-duplicate-dict-in-list-in-python/9427216#comment11927336_9427216

You need to log in before you can comment on or make changes to this bug.

Bugzilla

We're storing duplicate performance data in perfherder

Categories

(Tree Management :: Perfherder, defect)

Tracking

(Not tracked)

People

(Reporter: wlach, Assigned: wlach)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Comment 12

Comment 13

Attachment

General

Description

File Name

Content Type