Closed Bug 1597476 Opened 4 years ago Closed 4 years ago

Optimize slow performance datum queries

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sclements, Assigned: sclements)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files)

Regarding the recent performance issues we've been having (bug 1570068) this has been flagged as needing to be optimized here: https://github.com/mozilla/treeherder/blob/master/treeherder/perf/alerts.py#L53

Per the new relic generate_alerts transaction details: https://rpm.newrelic.com/accounts/677903/applications/14179757/datastores?tw%5Bend%5D=1574069940&tw%5Bstart%5D=1573465140

Priority: P2 → P1

There is an index called performance_datum_repository_id_795c16d99ae557e2_idx on the columns repository_id, signature_id, and push_timestamp. The database can use it to filter-and-sort faster if the repo is included in the filter.

Alternatively, we can replace the index with one that is suggested; on just the (signature_id, push_timestamp) pair

FYI

There are three processes contending for performance data in the database

  • generate_new_alerts_in_series() - which seems to get the bulk of the blame
  • /app/treeherder/webapp/api/performance_data.py, line 208 (which may be significant, and better explains the overall increase on prod, not stage, when no code seemed to have changed)
  • perf.models.cycle_data() - which shows itself as spikes in the transaction time of the other two (I do not see the SQL being recorded, so maybe we are missing how big this is)

instead of

latest_alert_timestamp = PerformanceAlert.objects.filter(
    series_signature=signature).select_related(
    'summary__push__time').order_by(
    '-summary__push__time').values_list(
    'summary__push__time', flat=True)[:1]

use an aggregate

latest_alert_timestamp = PerformanceAlert
    .objects
    .filter(series_signature=signature)
    .select_related('summary__push__time')
    .aggregate(Max('summary__push__time'))

assuming django is composing the correct SQL

select max(p.time)
from performance_alert a 
left join performance_alert_summary s on s.id = a.summary_id
left join push p on p.id = s.push_id
where a.series_signature_id = '<signature_id>'

This code

    if signature_hashes:
        signature_ids = PerformanceSignature.objects.filter(
            repository=repository,
            signature_hash__in=signature_hashes).values_list('id', flat=True)
        datums = datums.filter(signature__id__in=list(signature_ids))

can be reduced to one SQL call

    if signature_hashes:
        datums = datums
            .filter(signature__signature_hash__in=signature_hashes)

assuming I know what django is doing to the SQL

wrong

Since the repository column is functionally dependent on the signature_id, we can be certain there are no negative effects to any query

Is that true? For a given signature, is there only one repo, or can the same signature be seen across repos?

Blocks: 1597136

wrong

Is that true?

No, it seems the hash can be the same across repos. We can not change the index. The code must be updated to include the repository

UNIQUE KEY `performance_signature_repository_id_68b8dbcb6bd6b693_uniq` (`repository_id`,`framework_id`,`signature_hash`),

(In reply to Kyle Lahnakoski [:ekyle] from comment #2)

  • generate_new_alerts_in_series() - which seems to get the bulk of the blame

Removing the order_by from here: https://github.com/mozilla/treeherder/blob/master/treeherder/perf/alerts.py#L53
had big performance implications when testing using the django shell plus command.

It caused execution time to go from 1.979306s to 0.090839s and doesn't appear to be necessary. Will keep digging into the other things you mentioned.

There is an index called performance_datum_repository_id_795c16d99ae557e2_idx on the columns repository_id, signature_id, >and push_timestamp. The database can use it to filter-and-sort faster if the repo is included in the filter.

Alternatively, we can replace the index with one that is suggested; on just the (signature_id, push_timestamp) pair

The signature arg is a full queryset object and the signature_id is already being utilized. So for this: PerformanceDatum.objects.filter(signature=signature).filter(push_timestamp__gte=max_alert_age)

Example sql stmt:

SELECT `performance_datum`.`id`,
       `performance_datum`.`repository_id`,
       `performance_datum`.`signature_id`,
       `performance_datum`.`value`,
       `performance_datum`.`push_timestamp`,
       `performance_datum`.`job_id`,
       `performance_datum`.`push_id`,
       `performance_datum`.`ds_job_id`,
       `performance_datum`.`result_set_id`
  FROM `performance_datum`
 WHERE (`performance_datum`.`signature_id` = 1537973 AND `performance_datum`.`push_timestamp` >= '2019-11-07 20:39:42.334980')
 LIMIT 21

This gives us the 0.090839s execution time.

(In reply to Kyle Lahnakoski [:ekyle] from comment #3)

instead of

latest_alert_timestamp = PerformanceAlert.objects.filter(
    series_signature=signature).select_related(
    'summary__push__time').order_by(
    '-summary__push__time').values_list(
    'summary__push__time', flat=True)[:1]

use an aggregate

latest_alert_timestamp = PerformanceAlert
    .objects
    .filter(series_signature=signature)
    .select_related('summary__push__time')
    .aggregate(Max('summary__push__time'))

assuming django is composing the correct SQL

select max(p.time)
from performance_alert a 
left join performance_alert_summary s on s.id = a.summary_id
left join push p on p.id = s.push_id
where a.series_signature_id = '<signature_id>'

That does work but execution time is about the same.

Summary: Optimize generate_alerts → Optimize slow performance datum queries

(In reply to Kyle Lahnakoski [:ekyle] from comment #2)

FYI

There are three processes contending for performance data in the database

  • generate_new_alerts_in_series() - which seems to get the bulk of the blame
  • /app/treeherder/webapp/api/performance_data.py, line 208 (which may be significant, and better explains the overall increase on prod, not stage, when no code seemed to have changed)
  • perf.models.cycle_data() - which shows itself as spikes in the transaction time of the other two (I do not see the SQL being recorded, so maybe we are missing how big this is)

I've combined some tweaks for the first two in a patch. I think we'll need to enable new relic insights to get a better idea of what's happening with perf cycle_data. I'd also like to look into making improvements to Performance Datum. There are notes about two deprecated columns we can probably get rid of, and I'd like to see if we are in fact utilizing all of the indexes on the model.

(In reply to Kyle Lahnakoski [:ekyle] from comment #8)

Is that true?

No, it seems the hash can be the same across repos. We can not change the index. The code must be updated to include the repository

UNIQUE KEY `performance_signature_repository_id_68b8dbcb6bd6b693_uniq` (`repository_id`,`framework_id`,`signature_hash`),

There is a bug to deprecate use of the hashes in favor of signature_id. But when I look at that query using the django debug toolbar, performance datum is the slow part.

Regarding this block

    if signature_hashes:
        datums = datums
            .filter(signature__signature_hash__in=signature_hashes)

It's slower to query this way (and it does already filter on repository, on line 200). What we're missing is the framework_id though, to utilize that index.

Blocks: 1599095

The PR got merged. Anything left to be addressed in here?

Thanks for the fix!

No, I think this is good.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: