Closed Bug 1171707 Opened 10 years ago Closed 10 years ago

perfherder compare summary inconsistently chooses which data points to use

Categories

(Tree Management :: Perfherder, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: karlt, Assigned: wlach)

References

()

Details

Attachments

(3 files)

STR: load URL Expected results: Summary includes all available data points. Actual results: Summary omits many data points. I don't see any pattern to which ones are omitted and it is not consistent.
Attached file result 1
Attached file result 2
So I'm trying to figure out what happened here. It looks like revision `5dfffedb0d3b` resolves to result set 53265 in treeherder: https://treeherder.mozilla.org/api/project/try/resultset/?revision=5dfffedb0d3b However, I'm not seeing this in the data here, even though the push_timestamp puts it within the range https://treeherder.mozilla.org/api/project/try/performance-data/0/get_performance_data/?interval_seconds=2592000&signatures=bdcd937dc5c3aa032d6a343a70082ea0c50a7337 Strangely, if you bump interval_seconds up to 7776000, I do see it: https://treeherder.mozilla.org/api/project/try/performance-data/0/get_performance_data/?interval_seconds=7776000&signatures=bdcd937dc5c3aa032d6a343a70082ea0c50a7337 None of this really explains the results Karl is posting, but my copy of compareperf consistently omits the data for 5dfffedb0d3b (but that's expected, given the missing data above). Karl, have you seen this happen more than once? I feel like you might have just accidentally run into a momentary data ingestion problem.
Flags: needinfo?(karlt)
Assignee: nobody → wlachance
I'm sure I've seen this more than once, and reasonably sure I've seen it for different revisions, because I was running comparisons with several different revisions, and the problems were frequent. I'm not seeing any non-determinism now, but it doesn't seem like an ingestion problem if the data was there and now is not. Data for many of the fab03912debe tests is also missing now, but snarkfest finds results for both revisions: http://perf.snarkfest.net/compare-talos/index.html?oldRevs=fab03912debe&newRev=5dfffedb0d3b&submit=true I don't know why interval_seconds is involved here (tests may start at various times, especially with retriggers), but, if perfherder is depending on buggy infrastructure that snarkfest is not, would you be able to move this report to the appropriate component, please?
Flags: needinfo?(karlt)
(In reply to Karl Tomlinson (ni?:karlt) from comment #4) > I don't know why interval_seconds is involved here (tests may start at > various > times, especially with retriggers), but, if perfherder is depending on > buggy infrastructure that snarkfest is not, would you be able to move this > report > to the appropriate component, please? So the push_timestamp is supposed to correspond to when the revision was pushed, expiry is supposed to be based on that, not when tests were run. Given that we're seeing the data in the larger series, I don't think timing/expiry is the problem -- rather, it seems like the database update is not working properly at the treeherder/perfherder level. If that's the case, this is the right component, I'll continue investigation.
The db locking code which is supposed to prevent the populate series jobs from overwriting each other is broken: https://github.com/mozilla/treeherder/blob/f92ba96cb29230a5cd8638142abb8f7441dddc90/treeherder/model/derived/jobs.py#L2050 This ultimately calls MySQL's get_lock function, which is fine, except that get_lock always returns ok (since you're just refreshing the lock). We should be calling IS_FREE_LOCK here to validate that nothing is locked before getting it: http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#function_is-free-lock
Attached file PR
I would have liked to do a unit test, but am not currently sure how as it would be rather timing dependent. I think getting some kind of fix in here is rather important for Whistler, as I'm going to try to encourage people to use this thing and it's not good if it's broken. :)
Attachment #8624831 - Flags: review?(cdawson)
Comment on attachment 8624831 [details] [review] PR Cam and I looked over this, and he convinced me to make some changes. I made the timeout configurable and also added some tests. It was a good thing I added the tests, as they revealed some more bugs in my implementation. Mauro & Cam: I think my test works, but it's possible the style could be improved. At your leisure, could you have a look and let me know?
Attachment #8624831 - Flags: feedback?(mdoglio)
Attachment #8624831 - Flags: feedback?(mdoglio) → feedback+
After discussion with :mdoglio at Whistler, I updated the unit test to make it more clear that it was a bit flaky. I think we might file a followup bug to handle the locking a bit better in the future. Mauro suggested doing the locking at the celery level, but on further thought, I wonder if a database-level lock might be better -- it's possible other things (e.g. data migration scripts) might want to modify performance series in the future.
Attachment #8624831 - Flags: review?(cdawson) → review+
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/4a74b3704e14946dadd26b33407bbb0630ea83c0 Bug 1171707 - Fix locking when concurrently updating performance series Before if two celery jobs were updating the same series, one would overwrite the other because the locking code did not actually work (it just always unconditonally got a new lock without checking if anything was using it). This fixes that.
I can't be 100% sure, but I strongly suspect the above commit should fix the issues seen in this bug. We can always open a new issue if this is still reproducible. We should have a build with fixed locking going to treeherder master in the next couple days.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Depends on: 1179228
Thanks, William. That sounds a likely cause of what I was seeing.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: