Closed Bug 1616019 Opened 4 years ago Closed 4 years ago

Old alerts with no job data created in treeherder

Categories

(Testing :: Performance, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: alexandrui, Assigned: alexandrui)

References

Details

We have a number of raptor alerts from September - December 2019 that appeared these days. The issue here is we don't have data about their jobs (jobs link doesn't appear anymore) as they are too old, so we're excluding the scenario of being re-triggered by someone.
Alert 23748
Alert 23737
Alert 23733
Alert 23540
Alert 23540
Alert 23231

Kyle, Armen, any idea why this kappened?

Flags: needinfo?(klahnakoski)
Flags: needinfo?(armenzg)

Alert items are still being added to those summaries.

Perfherder is not my area of expertise.

I don't know why those mozilla-beta alerts would have shown up. Looking at the graphs does not seem to indicate there's even a regression.

Flags: needinfo?(armenzg)

Not sure I understand what you're saying, Armen. The alerts were created because perfherder detected an improvement/regression (the alerts are marked with a vertical line). Here is the graph of the alert Alert 23748. I'm saying this not because you don't know, but to make sure we're taking about the same thing.

Version: Version 3 → unspecified

The alerts seems to not be created in 2020, but back around the date of the data-points. I'm monitoring further the alerts queue as I suspect their status was reset manually/by some automation script.

:igoldan

Can you point to specific records in the database that have this problem? I would imagine that perf results have no jobs because we retain perf for longer. But if we are expecting perf to have a job, then we should have a SQL query that returns records that break some rule. With this query we can then make a redash dashboard showing the problem and posting it here for fixing

Flags: needinfo?(klahnakoski)

Here are the records that might show the problem?

select d.*
from performance_datum d
left join job j on j.id = d.job_id
where j.id IS NULL and d.id>(select min(d.id)
    from performance_datum d
    where d.push_timestamp>date_add(CURDATE(), interval -60 day)
)
limit 100

The d.id>(select min(d.id) from performance_datum d where d.push_timestamp>date_add(CURDATE(), interval -60 day)) is a pathology to tell MySQL to focus on min(d.id) first, then go find the missing records. Otherwise the query takes too long.

:alexandrui, I've marked this as assigned to you since it's marked as a P1. If you unassign yourself from it, you will have to change the priority to P2 (or another priority based on the triage guidelines).

Assignee: nobody → aionescu
Status: NEW → ASSIGNED

The issue is that alerts from 4 months ago are missing job data, correct? We expire jobs for any jobs older than 4 months but if we were somehow prematurely expiring data job data this would affect all of Treeherder so this sounds like something very specific to Perfherder backend logic.

Since the only people who work on Perfherder is the Perf developers it'd be preferable to only pull someone in from the Treeherder team to help after investigating and determining it was a change we made that has caused the problem.

Alex, if you find any more alert summaries like these, post them in comment 0, but please don't change them in any way. Otherwise you're overwriting their timestamps & it's hard to troubleshoot the issue.

Just leave them as they are.

Flags: needinfo?(aionescu)

I'm not really sure if this is a problem of missing jobs. I think this has more to do with why we're getting alerts on such old data points.
In general, alerts like these are generated when sheriffs trigger data points from old pushes. But we didn't see any such activity.

This feels to me like something happened with the ingestion pipeline and/or with the data from that time interval (Oct 10 up to Nov 10).
During that time, we had some very serious changes, such as meta bug 1597476 or the deployment of the new Taskcluster (forgot when that happened).

The meta bug came with changes to Perfherder's pipeline & other manual interventions over the Celery queues (if I'm correct). Regarding the Taskcluster deploy... Don't know what implications that could have brought.

Anyway, this doesn't seem like a trivial investigation.

Kyle, I'm not able to run that query. It takes too much time & it's aborted.

Flags: needinfo?(aionescu)

:igoldan If it takes too much time, run the inner part first:

select min(d.id) from performance_datum d where d.push_timestamp>date_add(CURDATE(), interval -60 day)

then plug that number into the second (or just run the second)

You can also try running a couple of times: The data will be loaded into memory, and run faster the second time.

(In reply to Ionuț Goldan [:igoldan] from comment #10)

I'm not really sure if this is a problem of missing jobs. I think this has more to do with why we're getting alerts on such old data points.
In general, alerts like these are generated when sheriffs trigger data points from old pushes. But we didn't see any such activity.

Is it possible someone accidentally triggered alerts on these old pushes (maybe while thinking they were testing changes on stage)?

This feels to me like something happened with the ingestion pipeline and/or with the data from that time interval (Oct 10 up to Nov 10).
During that time, we had some very serious changes, such as meta bug 1597476 or the deployment of the new Taskcluster (forgot when that happened).

The meta bug came with changes to Perfherder's pipeline & other manual interventions over the Celery queues (if I'm correct). Regarding the Taskcluster deploy... Don't know what implications that could have brought.

That taskcluster changeover was on Nov 9th. How would the ingestion or API changes cause pushes during that time to have alerts generated for a future date though (and only for those during that time period)? I'm not familiar with how all of the alert logic works but yes, this will require a thorough investigation.

(In reply to Kyle Lahnakoski [:ekyle] from comment #6)

Here are the records that might show the problem?

I lowered the limit to 10:

id		ds_job_id	result_set_id	value			push_timestamp	repository_id	signature_id	push_id
1035786415					104.131474018097	2020-01-29 05:00	4	2161426	633825
1035786416					12.154070854187		2020-01-29 05:00	4	2161038	633825
1035786417					49.4382560253143	2020-01-29 05:00	4	2161041	633825
1035786418					167.942676067352	2020-01-29 05:00	4	2161043	633825
1035786419					167.942676067352	2020-01-29 05:00	4	2161427	633825
1035786420					167.942676067352	2020-01-29 05:00	4	2161428	633825
1051149719					294.84099984169		2020-02-18 02:27	4	2163564	643891
1051149721					12.4919998645782	2020-02-18 02:27	4	2163565	643891
1051149722					319.890000104904	2020-02-18 02:27	4	2163889	643891
1051149724					629.410000085831	2020-02-18 02:27	4	2163568	643891

Maybe it tells you something.
Also, the fact that there are about 200 records in 2020 in performance_datum table with job_id is null should be a reason to worry?

Flags: needinfo?(klahnakoski)

In the next two weeks we will have the performance data in BigQuery. It will be easier to see what these are. I suspect there is something submitting perfherder records to treeherder.

Depends on: 1610347
Flags: needinfo?(klahnakoski)

While setting up the second prototype deployment I noticed that there is a celery task called generate_alerts, which looks to be part of the etl pipeline: https://github.com/mozilla/treeherder/blob/master/treeherder/etl/perf.py#L206 so these alerts are not just triggered manually.

(In reply to Sarah Clements [:sclements] from comment #16)

While setting up the second prototype deployment I noticed that there is a celery task called generate_alerts, which looks to be part of the etl pipeline: https://github.com/mozilla/treeherder/blob/master/treeherder/etl/perf.py#L206 so these alerts are not just triggered manually.

Might be a good hint. Thanks!

I will keep this bug open until we figure out what's the issue.

I had looked at a few of the alerts you posted and they had the manually_created properties set to false, so I'm thinking that might be it. Perhaps there's an edge case in that code.

Per what I discussed with Ionut on friday, next time you see this issue please don't modify the alerts because I'll need to look at the original timestamp in order to investigate. Please post those alerts here and need info me.

Closing this as nothing new came along for a month.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.