Old alerts with no job data created in treeherder
Categories
(Testing :: Performance, defect, P1)
Tracking
(Not tracked)
People
(Reporter: alexandrui, Assigned: alexandrui)
References
Details
We have a number of raptor alerts from September - December 2019 that appeared these days. The issue here is we don't have data about their jobs (jobs
link doesn't appear anymore) as they are too old, so we're excluding the scenario of being re-triggered by someone.
Alert 23748
Alert 23737
Alert 23733
Alert 23540
Alert 23540
Alert 23231
Kyle, Armen, any idea why this kappened?
Assignee | ||
Comment 1•4 years ago
|
||
Alert items are still being added to those summaries.
Comment 2•4 years ago
|
||
Perfherder is not my area of expertise.
I don't know why those mozilla-beta
alerts would have shown up. Looking at the graphs does not seem to indicate there's even a regression.
Assignee | ||
Comment 3•4 years ago
|
||
Not sure I understand what you're saying, Armen. The alerts were created because perfherder detected an improvement/regression (the alerts are marked with a vertical line). Here is the graph of the alert Alert 23748. I'm saying this not because you don't know, but to make sure we're taking about the same thing.
Updated•4 years ago
|
Assignee | ||
Comment 4•4 years ago
|
||
The alerts seems to not be created in 2020, but back around the date of the data-points. I'm monitoring further the alerts queue as I suspect their status was reset manually/by some automation script.
Comment 5•4 years ago
|
||
:igoldan
Can you point to specific records in the database that have this problem? I would imagine that perf results have no jobs because we retain perf for longer. But if we are expecting perf to have a job, then we should have a SQL query that returns records that break some rule. With this query we can then make a redash dashboard showing the problem and posting it here for fixing
Comment 6•4 years ago
•
|
||
Here are the records that might show the problem?
select d.*
from performance_datum d
left join job j on j.id = d.job_id
where j.id IS NULL and d.id>(select min(d.id)
from performance_datum d
where d.push_timestamp>date_add(CURDATE(), interval -60 day)
)
limit 100
The d.id>(select min(d.id) from performance_datum d where d.push_timestamp>date_add(CURDATE(), interval -60 day))
is a pathology to tell MySQL to focus on min(d.id)
first, then go find the missing records. Otherwise the query takes too long.
Comment 7•4 years ago
|
||
:alexandrui, I've marked this as assigned to you since it's marked as a P1. If you unassign yourself from it, you will have to change the priority to P2 (or another priority based on the triage guidelines).
Comment 8•4 years ago
•
|
||
The issue is that alerts from 4 months ago are missing job data, correct? We expire jobs for any jobs older than 4 months but if we were somehow prematurely expiring data job data this would affect all of Treeherder so this sounds like something very specific to Perfherder backend logic.
Since the only people who work on Perfherder is the Perf developers it'd be preferable to only pull someone in from the Treeherder team to help after investigating and determining it was a change we made that has caused the problem.
Comment 9•4 years ago
•
|
||
Alex, if you find any more alert summaries like these, post them in comment 0, but please don't change them in any way. Otherwise you're overwriting their timestamps & it's hard to troubleshoot the issue.
Just leave them as they are.
Comment 10•4 years ago
•
|
||
I'm not really sure if this is a problem of missing jobs. I think this has more to do with why we're getting alerts on such old data points.
In general, alerts like these are generated when sheriffs trigger data points from old pushes. But we didn't see any such activity.
This feels to me like something happened with the ingestion pipeline and/or with the data from that time interval (Oct 10 up to Nov 10).
During that time, we had some very serious changes, such as meta bug 1597476 or the deployment of the new Taskcluster (forgot when that happened).
The meta bug came with changes to Perfherder's pipeline & other manual interventions over the Celery queues (if I'm correct). Regarding the Taskcluster deploy... Don't know what implications that could have brought.
Anyway, this doesn't seem like a trivial investigation.
Comment 11•4 years ago
|
||
Kyle, I'm not able to run that query. It takes too much time & it's aborted.
Assignee | ||
Updated•4 years ago
|
Comment 12•4 years ago
|
||
:igoldan If it takes too much time, run the inner part first:
select min(d.id) from performance_datum d where d.push_timestamp>date_add(CURDATE(), interval -60 day)
then plug that number into the second (or just run the second)
You can also try running a couple of times: The data will be loaded into memory, and run faster the second time.
Comment 13•4 years ago
•
|
||
(In reply to Ionuț Goldan [:igoldan] from comment #10)
I'm not really sure if this is a problem of missing jobs. I think this has more to do with why we're getting alerts on such old data points.
In general, alerts like these are generated when sheriffs trigger data points from old pushes. But we didn't see any such activity.
Is it possible someone accidentally triggered alerts on these old pushes (maybe while thinking they were testing changes on stage)?
This feels to me like something happened with the ingestion pipeline and/or with the data from that time interval (Oct 10 up to Nov 10).
During that time, we had some very serious changes, such as meta bug 1597476 or the deployment of the new Taskcluster (forgot when that happened).The meta bug came with changes to Perfherder's pipeline & other manual interventions over the Celery queues (if I'm correct). Regarding the Taskcluster deploy... Don't know what implications that could have brought.
That taskcluster changeover was on Nov 9th. How would the ingestion or API changes cause pushes during that time to have alerts generated for a future date though (and only for those during that time period)? I'm not familiar with how all of the alert logic works but yes, this will require a thorough investigation.
Assignee | ||
Comment 14•4 years ago
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #6)
Here are the records that might show the problem?
I lowered the limit to 10:
id ds_job_id result_set_id value push_timestamp repository_id signature_id push_id
1035786415 104.131474018097 2020-01-29 05:00 4 2161426 633825
1035786416 12.154070854187 2020-01-29 05:00 4 2161038 633825
1035786417 49.4382560253143 2020-01-29 05:00 4 2161041 633825
1035786418 167.942676067352 2020-01-29 05:00 4 2161043 633825
1035786419 167.942676067352 2020-01-29 05:00 4 2161427 633825
1035786420 167.942676067352 2020-01-29 05:00 4 2161428 633825
1051149719 294.84099984169 2020-02-18 02:27 4 2163564 643891
1051149721 12.4919998645782 2020-02-18 02:27 4 2163565 643891
1051149722 319.890000104904 2020-02-18 02:27 4 2163889 643891
1051149724 629.410000085831 2020-02-18 02:27 4 2163568 643891
Maybe it tells you something.
Also, the fact that there are about 200 records in 2020 in performance_datum
table with job_id is null
should be a reason to worry?
Comment 15•4 years ago
|
||
In the next two weeks we will have the performance data in BigQuery. It will be easier to see what these are. I suspect there is something submitting perfherder records to treeherder.
Comment 16•4 years ago
•
|
||
While setting up the second prototype deployment I noticed that there is a celery task called generate_alerts
, which looks to be part of the etl pipeline: https://github.com/mozilla/treeherder/blob/master/treeherder/etl/perf.py#L206 so these alerts are not just triggered manually.
Assignee | ||
Comment 17•4 years ago
•
|
||
(In reply to Sarah Clements [:sclements] from comment #16)
While setting up the second prototype deployment I noticed that there is a celery task called
generate_alerts
, which looks to be part of the etl pipeline: https://github.com/mozilla/treeherder/blob/master/treeherder/etl/perf.py#L206 so these alerts are not just triggered manually.
Might be a good hint. Thanks!
I will keep this bug open until we figure out what's the issue.
Comment 18•4 years ago
|
||
I had looked at a few of the alerts you posted and they had the manually_created
properties set to false, so I'm thinking that might be it. Perhaps there's an edge case in that code.
Per what I discussed with Ionut on friday, next time you see this issue please don't modify the alerts because I'll need to look at the original timestamp in order to investigate. Please post those alerts here and need info me.
Assignee | ||
Comment 19•4 years ago
|
||
Closing this as nothing new came along for a month.
Description
•