Closed Bug 1171707 Opened 10 years ago Closed 10 years ago

perfherder compare summary inconsistently chooses which data points to use

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: karlt, Assigned: wlach)

References

(
URL
)

Details

Attachments

(3 files)

result 1 10 years ago Karl Tomlinson (:karlt) 44.84 KB, text/html		Details
result 2 10 years ago Karl Tomlinson (:karlt) 52.76 KB, text/html		Details
PR 10 years ago William Lachance (:wlach) 46 bytes, text/x-github-pull-request	camd : review+ mdoglio : feedback+	Details \| Review

Karl Tomlinson (:karlt)

Reporter

Description

•

10 years ago

STR: load URL Expected results: Summary includes all available data points. Actual results: Summary omits many data points. I don't see any pattern to which ones are omitted and it is not consistent.

Karl Tomlinson (:karlt)

Reporter

Comment 1

•

10 years ago

Attached file result 1 — Details

Karl Tomlinson (:karlt)

Reporter

Comment 2

•

10 years ago

Attached file result 2 — Details

William Lachance (:wlach)

Assignee

Comment 3

•

10 years ago

So I'm trying to figure out what happened here. It looks like revision `5dfffedb0d3b` resolves to result set 53265 in treeherder: https://treeherder.mozilla.org/api/project/try/resultset/?revision=5dfffedb0d3b However, I'm not seeing this in the data here, even though the push_timestamp puts it within the range https://treeherder.mozilla.org/api/project/try/performance-data/0/get_performance_data/?interval_seconds=2592000&signatures=bdcd937dc5c3aa032d6a343a70082ea0c50a7337 Strangely, if you bump interval_seconds up to 7776000, I do see it: https://treeherder.mozilla.org/api/project/try/performance-data/0/get_performance_data/?interval_seconds=7776000&signatures=bdcd937dc5c3aa032d6a343a70082ea0c50a7337 None of this really explains the results Karl is posting, but my copy of compareperf consistently omits the data for 5dfffedb0d3b (but that's expected, given the missing data above). Karl, have you seen this happen more than once? I feel like you might have just accidentally run into a momentary data ingestion problem.

Flags: needinfo?(karlt)

William Lachance (:wlach)

Assignee

Updated

•

10 years ago

Assignee: nobody → wlachance

Karl Tomlinson (:karlt)

Reporter

Comment 4

•

10 years ago

I'm sure I've seen this more than once, and reasonably sure I've seen it for different revisions, because I was running comparisons with several different revisions, and the problems were frequent. I'm not seeing any non-determinism now, but it doesn't seem like an ingestion problem if the data was there and now is not. Data for many of the fab03912debe tests is also missing now, but snarkfest finds results for both revisions: http://perf.snarkfest.net/compare-talos/index.html?oldRevs=fab03912debe&newRev=5dfffedb0d3b&submit=true I don't know why interval_seconds is involved here (tests may start at various times, especially with retriggers), but, if perfherder is depending on buggy infrastructure that snarkfest is not, would you be able to move this report to the appropriate component, please?

Flags: needinfo?(karlt)

William Lachance (:wlach)

Assignee

Comment 5

•

10 years ago

(In reply to Karl Tomlinson (ni?:karlt) from comment #4) > I don't know why interval_seconds is involved here (tests may start at > various > times, especially with retriggers), but, if perfherder is depending on > buggy infrastructure that snarkfest is not, would you be able to move this > report > to the appropriate component, please? So the push_timestamp is supposed to correspond to when the revision was pushed, expiry is supposed to be based on that, not when tests were run. Given that we're seeing the data in the larger series, I don't think timing/expiry is the problem -- rather, it seems like the database update is not working properly at the treeherder/perfherder level. If that's the case, this is the right component, I'll continue investigation.

Karl Tomlinson (:karlt)

Reporter

Comment 6

•

10 years ago

Some more examples where perfherder can no longer find the data but snarkfest can, but these were all run over the same 2 or 3 days IIRC. https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8e4634d4bdfa&newProject=try&newRevision=700ae3bf7429 http://perf.snarkfest.net/compare-talos/index.html?oldRevs=8e4634d4bdfa&newRev=700ae3bf7429&submit=true https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9eae3880b132&newProject=try&newRevision=44d63c38e2c1 http://perf.snarkfest.net/compare-talos/index.html?oldRevs=9eae3880b132&newRev=44d63c38e2c1&submit=true

William Lachance (:wlach)

Assignee

Comment 7

•

10 years ago

The db locking code which is supposed to prevent the populate series jobs from overwriting each other is broken: https://github.com/mozilla/treeherder/blob/f92ba96cb29230a5cd8638142abb8f7441dddc90/treeherder/model/derived/jobs.py#L2050 This ultimately calls MySQL's get_lock function, which is fine, except that get_lock always returns ok (since you're just refreshing the lock). We should be calling IS_FREE_LOCK here to validate that nothing is locked before getting it: http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#function_is-free-lock

William Lachance (:wlach)

Assignee

Comment 8

•

10 years ago

Attached file PR — Details

I would have liked to do a unit test, but am not currently sure how as it would be rather timing dependent. I think getting some kind of fix in here is rather important for Whistler, as I'm going to try to encourage people to use this thing and it's not good if it's broken. :)

Attachment #8624831 - Flags: review?(cdawson)

William Lachance (:wlach)

Assignee

Comment 9

•

10 years ago

Comment on attachment 8624831 [details] [review] PR Cam and I looked over this, and he convinced me to make some changes. I made the timeout configurable and also added some tests. It was a good thing I added the tests, as they revealed some more bugs in my implementation. Mauro & Cam: I think my test works, but it's possible the style could be improved. At your leisure, could you have a look and let me know?

Attachment #8624831 - Flags: feedback?(mdoglio)

Mauro Doglio [:mdoglio]

Updated

•

10 years ago

Attachment #8624831 - Flags: feedback?(mdoglio) → feedback+

William Lachance (:wlach)

Assignee

Comment 10

•

10 years ago

After discussion with :mdoglio at Whistler, I updated the unit test to make it more clear that it was a bit flaky. I think we might file a followup bug to handle the locking a bit better in the future. Mauro suggested doing the locking at the celery level, but on further thought, I wonder if a database-level lock might be better -- it's possible other things (e.g. data migration scripts) might want to modify performance series in the future.

Cameron Dawson [:camd]

Updated

•

10 years ago

Attachment #8624831 - Flags: review?(cdawson) → review+

Treeherder GitHub Bugbot

Comment 11

•

10 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/4a74b3704e14946dadd26b33407bbb0630ea83c0 Bug 1171707 - Fix locking when concurrently updating performance series Before if two celery jobs were updating the same series, one would overwrite the other because the locking code did not actually work (it just always unconditonally got a new lock without checking if anything was using it). This fixes that.

William Lachance (:wlach)

Assignee

Comment 12

•

10 years ago

I can't be 100% sure, but I strongly suspect the above commit should fix the issues seen in this bug. We can always open a new issue if this is still reproducible. We should have a build with fixed locking going to treeherder master in the next couple days.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Updated

•

10 years ago

Depends on: 1179228

Karl Tomlinson (:karlt)

Reporter

Comment 13

•

10 years ago

Thanks, William. That sounds a likely cause of what I was seeing.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

perfherder compare summary inconsistently chooses which data points to use

Categories

(Tree Management :: Perfherder, defect)

Tracking

(Not tracked)

People

(Reporter: karlt, Assigned: wlach)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Updated

Comment 11

Comment 12

Updated

Comment 13

Attachment

General

Description

File Name

Content Type