Closed
Bug 1171707
Opened 10 years ago
Closed 10 years ago
perfherder compare summary inconsistently chooses which data points to use
Categories
(Tree Management :: Perfherder, defect)
Tree Management
Perfherder
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: karlt, Assigned: wlach)
References
()
Details
Attachments
(3 files)
STR: load URL
Expected results:
Summary includes all available data points.
Actual results:
Summary omits many data points.
I don't see any pattern to which ones are omitted and it is not consistent.
Reporter | ||
Comment 1•10 years ago
|
||
Reporter | ||
Comment 2•10 years ago
|
||
Assignee | ||
Comment 3•10 years ago
|
||
So I'm trying to figure out what happened here. It looks like revision `5dfffedb0d3b` resolves to result set 53265 in treeherder:
https://treeherder.mozilla.org/api/project/try/resultset/?revision=5dfffedb0d3b
However, I'm not seeing this in the data here, even though the push_timestamp puts it within the range
https://treeherder.mozilla.org/api/project/try/performance-data/0/get_performance_data/?interval_seconds=2592000&signatures=bdcd937dc5c3aa032d6a343a70082ea0c50a7337
Strangely, if you bump interval_seconds up to 7776000, I do see it:
https://treeherder.mozilla.org/api/project/try/performance-data/0/get_performance_data/?interval_seconds=7776000&signatures=bdcd937dc5c3aa032d6a343a70082ea0c50a7337
None of this really explains the results Karl is posting, but my copy of compareperf consistently omits the data for 5dfffedb0d3b (but that's expected, given the missing data above). Karl, have you seen this happen more than once? I feel like you might have just accidentally run into a momentary data ingestion problem.
Flags: needinfo?(karlt)
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → wlachance
Reporter | ||
Comment 4•10 years ago
|
||
I'm sure I've seen this more than once, and reasonably sure I've seen it for different revisions, because I was running comparisons with several different revisions, and the problems were frequent.
I'm not seeing any non-determinism now, but it doesn't seem like an ingestion
problem if the data was there and now is not.
Data for many of the fab03912debe tests is also missing now, but snarkfest
finds results for both revisions:
http://perf.snarkfest.net/compare-talos/index.html?oldRevs=fab03912debe&newRev=5dfffedb0d3b&submit=true
I don't know why interval_seconds is involved here (tests may start at various
times, especially with retriggers), but, if perfherder is depending on
buggy infrastructure that snarkfest is not, would you be able to move this report
to the appropriate component, please?
Flags: needinfo?(karlt)
Assignee | ||
Comment 5•10 years ago
|
||
(In reply to Karl Tomlinson (ni?:karlt) from comment #4)
> I don't know why interval_seconds is involved here (tests may start at
> various
> times, especially with retriggers), but, if perfherder is depending on
> buggy infrastructure that snarkfest is not, would you be able to move this
> report
> to the appropriate component, please?
So the push_timestamp is supposed to correspond to when the revision was pushed, expiry is supposed to be based on that, not when tests were run. Given that we're seeing the data in the larger series, I don't think timing/expiry is the problem -- rather, it seems like the database update is not working properly at the treeherder/perfherder level. If that's the case, this is the right component, I'll continue investigation.
Reporter | ||
Comment 6•10 years ago
|
||
Some more examples where perfherder can no longer find the data but snarkfest can, but these were all run over the same 2 or 3 days IIRC.
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8e4634d4bdfa&newProject=try&newRevision=700ae3bf7429
http://perf.snarkfest.net/compare-talos/index.html?oldRevs=8e4634d4bdfa&newRev=700ae3bf7429&submit=true
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9eae3880b132&newProject=try&newRevision=44d63c38e2c1
http://perf.snarkfest.net/compare-talos/index.html?oldRevs=9eae3880b132&newRev=44d63c38e2c1&submit=true
Assignee | ||
Comment 7•10 years ago
|
||
The db locking code which is supposed to prevent the populate series jobs from overwriting each other is broken:
https://github.com/mozilla/treeherder/blob/f92ba96cb29230a5cd8638142abb8f7441dddc90/treeherder/model/derived/jobs.py#L2050
This ultimately calls MySQL's get_lock function, which is fine, except that get_lock always returns ok (since you're just refreshing the lock). We should be calling IS_FREE_LOCK here to validate that nothing is locked before getting it:
http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#function_is-free-lock
Assignee | ||
Comment 8•10 years ago
|
||
I would have liked to do a unit test, but am not currently sure how as it would be rather timing dependent. I think getting some kind of fix in here is rather important for Whistler, as I'm going to try to encourage people to use this thing and it's not good if it's broken. :)
Attachment #8624831 -
Flags: review?(cdawson)
Assignee | ||
Comment 9•10 years ago
|
||
Comment on attachment 8624831 [details] [review]
PR
Cam and I looked over this, and he convinced me to make some changes. I made the timeout configurable and also added some tests. It was a good thing I added the tests, as they revealed some more bugs in my implementation.
Mauro & Cam: I think my test works, but it's possible the style could be improved. At your leisure, could you have a look and let me know?
Attachment #8624831 -
Flags: feedback?(mdoglio)
![]() |
||
Updated•10 years ago
|
Attachment #8624831 -
Flags: feedback?(mdoglio) → feedback+
Assignee | ||
Comment 10•10 years ago
|
||
After discussion with :mdoglio at Whistler, I updated the unit test to make it more clear that it was a bit flaky. I think we might file a followup bug to handle the locking a bit better in the future. Mauro suggested doing the locking at the celery level, but on further thought, I wonder if a database-level lock might be better -- it's possible other things (e.g. data migration scripts) might want to modify performance series in the future.
Updated•10 years ago
|
Attachment #8624831 -
Flags: review?(cdawson) → review+
Comment 11•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/4a74b3704e14946dadd26b33407bbb0630ea83c0
Bug 1171707 - Fix locking when concurrently updating performance series
Before if two celery jobs were updating the same series, one would overwrite
the other because the locking code did not actually work (it just always
unconditonally got a new lock without checking if anything was using it).
This fixes that.
Assignee | ||
Comment 12•10 years ago
|
||
I can't be 100% sure, but I strongly suspect the above commit should fix the issues seen in this bug. We can always open a new issue if this is still reproducible.
We should have a build with fixed locking going to treeherder master in the next couple days.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 13•10 years ago
|
||
Thanks, William. That sounds a likely cause of what I was seeing.
You need to log in
before you can comment on or make changes to this bug.
Description
•