Optimize slow performance datum queries
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P1)
Tracking
(Not tracked)
People
(Reporter: sclements, Assigned: sclements)
References
(Blocks 2 open bugs)
Details
Attachments
(2 files)
Regarding the recent performance issues we've been having (bug 1570068) this has been flagged as needing to be optimized here: https://github.com/mozilla/treeherder/blob/master/treeherder/perf/alerts.py#L53
Per the new relic generate_alerts transaction details: https://rpm.newrelic.com/accounts/677903/applications/14179757/datastores?tw%5Bend%5D=1574069940&tw%5Bstart%5D=1573465140
Assignee | ||
Updated•4 years ago
|
Comment 1•4 years ago
|
||
There is an index called performance_datum_repository_id_795c16d99ae557e2_idx
on the columns repository_id
, signature_id
, and push_timestamp
. The database can use it to filter-and-sort faster if the repo
is included in the filter.
Alternatively, we can replace the index with one that is suggested; on just the (signature_id, push_timestamp)
pair
Comment 2•4 years ago
|
||
FYI
There are three processes contending for performance data in the database
generate_new_alerts_in_series()
- which seems to get the bulk of the blame/app/treeherder/webapp/api/performance_data.py
, line 208 (which may be significant, and better explains the overall increase on prod, not stage, when no code seemed to have changed)perf.models.cycle_data()
- which shows itself as spikes in the transaction time of the other two (I do not see the SQL being recorded, so maybe we are missing how big this is)
Comment 3•4 years ago
|
||
instead of
latest_alert_timestamp = PerformanceAlert.objects.filter(
series_signature=signature).select_related(
'summary__push__time').order_by(
'-summary__push__time').values_list(
'summary__push__time', flat=True)[:1]
use an aggregate
latest_alert_timestamp = PerformanceAlert
.objects
.filter(series_signature=signature)
.select_related('summary__push__time')
.aggregate(Max('summary__push__time'))
assuming django is composing the correct SQL
select max(p.time)
from performance_alert a
left join performance_alert_summary s on s.id = a.summary_id
left join push p on p.id = s.push_id
where a.series_signature_id = '<signature_id>'
Comment 4•4 years ago
|
||
This code
if signature_hashes:
signature_ids = PerformanceSignature.objects.filter(
repository=repository,
signature_hash__in=signature_hashes).values_list('id', flat=True)
datums = datums.filter(signature__id__in=list(signature_ids))
can be reduced to one SQL call
if signature_hashes:
datums = datums
.filter(signature__signature_hash__in=signature_hashes)
assuming I know what django is doing to the SQL
Comment 5•4 years ago
•
|
||
wrong
Comment 6•4 years ago
|
||
Since the repository column is functionally dependent on the signature_id, we can be certain there are no negative effects to any query
Is that true? For a given signature, is there only one repo, or can the same signature be seen across repos?
Comment 7•4 years ago
•
|
||
wrong
Comment 8•4 years ago
|
||
Is that true?
No, it seems the hash can be the same across repos. We can not change the index. The code must be updated to include the repository
UNIQUE KEY `performance_signature_repository_id_68b8dbcb6bd6b693_uniq` (`repository_id`,`framework_id`,`signature_hash`),
Assignee | ||
Comment 9•4 years ago
•
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #2)
generate_new_alerts_in_series()
- which seems to get the bulk of the blame
Removing the order_by
from here: https://github.com/mozilla/treeherder/blob/master/treeherder/perf/alerts.py#L53
had big performance implications when testing using the django shell plus command.
It caused execution time to go from 1.979306s to 0.090839s and doesn't appear to be necessary. Will keep digging into the other things you mentioned.
There is an index called performance_datum_repository_id_795c16d99ae557e2_idx on the columns repository_id, signature_id, >and push_timestamp. The database can use it to filter-and-sort faster if the repo is included in the filter.
Alternatively, we can replace the index with one that is suggested; on just the (signature_id, push_timestamp) pair
The signature
arg is a full queryset object and the signature_id is already being utilized. So for this: PerformanceDatum.objects.filter(signature=signature).filter(push_timestamp__gte=max_alert_age)
Example sql stmt:
SELECT `performance_datum`.`id`,
`performance_datum`.`repository_id`,
`performance_datum`.`signature_id`,
`performance_datum`.`value`,
`performance_datum`.`push_timestamp`,
`performance_datum`.`job_id`,
`performance_datum`.`push_id`,
`performance_datum`.`ds_job_id`,
`performance_datum`.`result_set_id`
FROM `performance_datum`
WHERE (`performance_datum`.`signature_id` = 1537973 AND `performance_datum`.`push_timestamp` >= '2019-11-07 20:39:42.334980')
LIMIT 21
This gives us the 0.090839s execution time.
Assignee | ||
Comment 10•4 years ago
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #3)
instead of
latest_alert_timestamp = PerformanceAlert.objects.filter( series_signature=signature).select_related( 'summary__push__time').order_by( '-summary__push__time').values_list( 'summary__push__time', flat=True)[:1]
use an aggregate
latest_alert_timestamp = PerformanceAlert .objects .filter(series_signature=signature) .select_related('summary__push__time') .aggregate(Max('summary__push__time'))
assuming django is composing the correct SQL
select max(p.time) from performance_alert a left join performance_alert_summary s on s.id = a.summary_id left join push p on p.id = s.push_id where a.series_signature_id = '<signature_id>'
That does work but execution time is about the same.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 11•4 years ago
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #2)
FYI
There are three processes contending for performance data in the database
generate_new_alerts_in_series()
- which seems to get the bulk of the blame/app/treeherder/webapp/api/performance_data.py
, line 208 (which may be significant, and better explains the overall increase on prod, not stage, when no code seemed to have changed)perf.models.cycle_data()
- which shows itself as spikes in the transaction time of the other two (I do not see the SQL being recorded, so maybe we are missing how big this is)
I've combined some tweaks for the first two in a patch. I think we'll need to enable new relic insights to get a better idea of what's happening with perf cycle_data. I'd also like to look into making improvements to Performance Datum. There are notes about two deprecated columns we can probably get rid of, and I'd like to see if we are in fact utilizing all of the indexes on the model.
Assignee | ||
Comment 12•4 years ago
•
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #8)
Is that true?
No, it seems the hash can be the same across repos. We can not change the index. The code must be updated to include the repository
UNIQUE KEY `performance_signature_repository_id_68b8dbcb6bd6b693_uniq` (`repository_id`,`framework_id`,`signature_hash`),
There is a bug to deprecate use of the hashes in favor of signature_id. But when I look at that query using the django debug toolbar, performance datum is the slow part.
Regarding this block
if signature_hashes:
datums = datums
.filter(signature__signature_hash__in=signature_hashes)
It's slower to query this way (and it does already filter on repository, on line 200). What we're missing is the framework_id though, to utilize that index.
Assignee | ||
Comment 13•4 years ago
|
||
Assignee | ||
Comment 14•4 years ago
|
||
Comment 15•4 years ago
|
||
The PR got merged. Anything left to be addressed in here?
Thanks for the fix!
Assignee | ||
Comment 16•4 years ago
|
||
No, I think this is good.
Description
•