Closed Bug 1668511 Opened 9 months ago Closed 8 months ago

Remove empty alert summaries

Categories

(Tree Management :: Perfherder, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: igoldan, Assigned: igoldan)

References

Details

Attachments

(3 files)

The perf data expiring algorithm can remove inactive perf signatures, which cascades to deleting perf alerts.

In turn, there are cases when alert summaries no longer have any perf alerts. This makes them useless & increases the query time without any real benefit.

Thus, we should expire perf alert summaries that are empty.

Note: normally, alert summaries & alerts should be created in a transaction. However, I don't think Perfherder used this approach. We should be careful how we delete summaries, such that we don't remove one of them while it's alerts are being created. A naive approach would be to simply let an alert summary cool down for a day or two & then remove it if it's still empty.

Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED

I've tested this on treeherder-stage & treeherder-prototype & it worked OK, without any problems. However, I've just realized the PR is a bit risky.
It's too eager to remove expired data. I mean: if the data is stale for more than an hour, it deletes it.

In the long term, if some extra buggy filter clauses are added here (in the highlighted code), we risk deleting all alert summaries older than an hour.
Perf sheriffing would run into chaos for possibly 2 weeks (it would be basically blind & clueless).

I'd rather revert this from master & increase the essential filter condition from created__lt=one_hour_ago to created__lt=one_year_ago.
It will make it much more bug tolerant & still expire stale summaries. Perf sheriffs really don't care about perf data older than 1 year.

Sarah, could you revert PR 6786 so I can make it safer?

Flags: needinfo?(sclements)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

(In reply to Ionuț Goldan [:igoldan] from comment #2)

I've tested this on treeherder-stage & treeherder-prototype & it worked OK, without any problems. However, I've just realized the PR is a bit risky.
It's too eager to remove expired data. I mean: if the data is stale for more than an hour, it deletes it.

In the long term, if some extra buggy filter clauses are added here (in the highlighted code), we risk deleting all alert summaries older than an hour.
Perf sheriffing would run into chaos for possibly 2 weeks (it would be basically blind & clueless).

I'd rather revert this from master & increase the essential filter condition from created__lt=one_hour_ago to created__lt=one_year_ago.
It will make it much more bug tolerant & still expire stale summaries. Perf sheriffs really don't care about perf data older than 1 year.

Sarah, could you revert PR 6786 so I can make it safer?

That sounds like a good call. What about alert summaries older than 6 months? I've reverted your pr and merged it to master.

Flags: needinfo?(sclements)

(In reply to Sarah Clements [:sclements] from comment #4)

[...] What about alert summaries older than 6 months? I've reverted your pr and merged it to master.

Yes, good idea. I got approval from @fstrugariu & @davehunt to expire summaries older than 6 months.

Status: REOPENED → RESOLVED
Closed: 8 months ago8 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.