Closed Bug 1182282 Opened 5 years ago Closed 5 years ago

We're storing duplicate performance data in perfherder

Categories

(Tree Management :: Perfherder, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wlach, Assigned: wlach)

References

Details

Attachments

(1 file)

PR
46 bytes, text/x-github-pull-request
camd
: review+
Details | Review
:jmaher pointed out that there's a lot of duplicate data in here:

https://treeherder.allizom.org/perf.html#/comparesubtest?originalProject=fx-team&originalRevision=9f307a1aba5c&newProject=fx-team&newRevision=259dcb2a811e&originalSignature=829c89815f01383f9ffc0ccef6331768d9bbf6fb&newSignature=829c89815f01383f9ffc0ccef6331768d9bbf6fb

On closer inspection, it looks like we're sometimes storing multiple numbers for the same job id. On talking with :camd, it appears that we don't have strong locking in the log parser -- it's quite possible for us to parse logs multiple times if it is taking a while to process a particular log (since we only abort log parsing if the parse status of a job is marked as "parsed", which only happens at the very end).

We should fix our function that populates performance data to abort if data for the current job id already exists. We should also clean up the old duplicate data, as it's consuming a bunch of extra space (a quick check on one signature for the yearly data showed 200 extra datapoints).

We also may want to set an interim state of "parsing" or something to reduce the instance of logs being parsed more than once. I don't think it's possible to completely guard against the same data being submitted more than once without a lot of extra complexity (and even that wouldn't help with things like taskcluster), but we should probably avoid doing extra work where we can.
See Also: → 1182201
Nice spot! Thank you :-)
URL: 1179223
See Also: 1182201
Blocks: 1179223
URL: 1179223
(In reply to William Lachance (:wlach) from comment #0)
> We also may want to set an interim state of "parsing" or something to reduce
> the instance of logs being parsed more than once. I don't think it's
> possible to completely guard against the same data being submitted more than
> once without a lot of extra complexity (and even that wouldn't help with
> things like taskcluster), but we should probably avoid doing extra work
> where we can.

It looks like bug 1178240 should basically do this.
That's different afaict; we don't reparse the logs in ingestion, we just try to put them in the jobs table/objectstore and bail if we cannot.
(In reply to Ed Morley [:emorley] from comment #3)
> That's different afaict; we don't reparse the logs in ingestion, we just try
> to put them in the jobs table/objectstore and bail if we cannot.

Scheduling of the log parse happens here:
https://github.com/mozilla/treeherder/blob/0f3d1b2bcf8700070e7e99892bf1990528811bd0/treeherder/model/derived/jobs.py#L1889

And the function _load_log_urls() is called from here:
https://github.com/mozilla/treeherder/blob/0f3d1b2bcf8700070e7e99892bf1990528811bd0/treeherder/model/derived/jobs.py#L1349

And before then, any jobs that already exist in the same state should have been filtered out:
https://github.com/mozilla/treeherder/blob/0f3d1b2bcf8700070e7e99892bf1990528811bd0/treeherder/model/derived/jobs.py#L1170-L1171

The main benefit of bug 1178240 is that we won't even have to filter those jobs out (by hitting the DB), since we will have skipped them in buildapi.py to start with (saving the whole round trip to objectstore/jobs API etc).
Attached file PR
I'm not really sure who's best to review this, so I guess I'll choose camd.

This is a pretty straightforward solution to the problem we discussed yesterday.
Attachment #8632166 - Flags: review?(cdawson)
Duplicate of this bug: 1179788
Comment on attachment 8632166 [details] [review]
PR

This looks good to me.  Would you open a bug for the issue in Comment 2 though?  I think more solid handling for duplicates would be good.
Attachment #8632166 - Flags: review?(cdawson) → review+
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/61904aaffaef2e44846ecf8c9e4f163dc149a970
Bug 1182282 - Handle submissions of duplicate performance data

It appears that on occasion we parse a log more than once, which
resulted in duplicate performance series going into the database.
Let's be resilient about this by not inserting duplicate jobs into the
database (we always attach a unique job id to every datapoint, so there's
no chance of accidentally removing entries which happen to have the same
performance numbers)
Filed bug 1183246 to deal with the duplicate log parsing issue.
See Also: → 1183246
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/b8106c9e118eb2e6d49ebe6580f8a52f18985302
Bug 1182282 - Fix one more case where dupe performance data could happen

If we don't sort the elements of the series, the same element dictionary
could have a different order. This seems to be the case in a small number
of cases.
(In reply to Treeherder Bugbot from comment #10)
> If we don't sort the elements of the series, the same element dictionary
> could have a different order. This seems to be the case in a small number
> of cases.

Care to explain what does this mean or imply?
You need to log in before you can comment on or make changes to this bug.