Closed
Bug 1286578
Opened 9 years ago
Closed 8 years ago
Resultset is left with a push_timestamp of 0 after being updated, making it virtually hidden
Categories
(Tree Management :: Treeherder: Data Ingestion, defect, P1)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cbook, Assigned: camd)
References
Details
Attachments
(1 file)
I updated mozilla-central with commits from autoland and that resulted in
https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=151aaa2db94d
but treeherder don't display this push in the timeline on https://treeherder.mozilla.org/#/jobs?repo=mozilla-central as it would never happened
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=151aaa2db94d3fabe8958b61904371b8f415065d works but it says i pushed jan 1 at 1:00 ?
can we look into this because i guess "missing" pushes in the timeline are bad, even when hg looks ok
Assignee | ||
Comment 1•9 years ago
|
||
Taking a look at this now.
Assignee | ||
Comment 2•9 years ago
|
||
For some reason on prod, the record for the resultset didn't get updated. On stage it did. If you look at the same push on allizom.org, it has the right data:
https://treeherder.allizom.org/#/jobs?repo=mozilla-central&revision=151aaa2db94d3fabe8958b61904371b8f415065d
I'm investigating what went wrong on prod. It could have been a race condition that we need to handle better. I can fix this record, but there is likely an underlying cause that will continue to bite us.
Assignee | ||
Comment 3•9 years ago
|
||
I didn't see any errors on the logs for the etl nodes for pushlogs. I fear this is something that's failing silently.
The issue is most likely in this code: https://github.com/mozilla/treeherder/blob/86f4bcbbd4248608fb814e12317b83ad69cca0c7/treeherder/model/derived/jobs.py#L1822-L1822
It's not getting the push information to update, or it doesn't think it NEEDS to update the push. Can't tell which yet. I am going to add some instrumentation to make new-relic events to give me some info here as best I can.
I will likely just fix this instance in the DB. I don't think I can tell too much more forensically from its current state.
Assignee | ||
Comment 4•9 years ago
|
||
Fixed in the DB now for that push. The issue looks like on stage, we got the info first from json-pushes. But on prod, we got a job with that revision prior to having the resultset created. So we created the "placeholder" resultset. We then back-filled it with all the revisions. But it didn't update the actual record to set the push_timestamp and author. The timestamp stayed at 0.
So something is wrong with the update logic in this scenario.
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → cdawson
Priority: -- → P1
Assignee | ||
Comment 5•9 years ago
|
||
Hrm.. I have a test case that tests this very scenario, afaict. I can't reproduce this yet.
Querying the database, I see a few instances back in May where we created the skeleton resultsets, but never got any revisions for them. So they stayed skeletons.
In this case, we got the revisions, but the resultset data itself didn't get updated. querying a few repos in the DB, this is the only instance of this error happening. I'll keep looking for a cause.
Assignee | ||
Comment 6•9 years ago
|
||
For clarity: having skeletons that never get "filled in" would indicate possible rogue jobs. Jobs that have a revision that doesn't actually exist in that repo. This has happened before due to buildbot mistakes in builds4hr. It looks like they were clustered in a couple-day range in may, so perhaps there was just an infra issue that got fixed and we haven't seen it again.
Assignee | ||
Comment 7•8 years ago
|
||
This happened again today on m-c. This bug needs to be fixed, for sure. But perhaps a better fall-back would be the make the push_timestamp be the time that the record was created. then it would at LEAST be in the ball-park of the push, rather 0.
This is tough to track down since it's so infrequent. And I have a test for this scenario. But clearly the test is insufficient in some way.
Assignee | ||
Updated•8 years ago
|
Summary: merge/hg update from autoland to mozilla-central missing on treeherder → Resultset is left with a push_timestamp of 0 after being updated, making it virtually hidden
Assignee | ||
Comment 8•8 years ago
|
||
Updated the summary to depict the more general problem.
Status: NEW → ASSIGNED
Can we at least setup something where something somewhere alerts if resultsets stay with push_timestamp==0? Then we could at least catch all of these as they happen (or soon after).
Comment 10•8 years ago
|
||
Assignee | ||
Comment 11•8 years ago
|
||
Ed had a great idea that instead of creating a skeleton to be able to ingest a job as it comes in, we could just re-queue the job task to rabbitmq until the corresponding resultset exists. I think that might be the best call going forward.
Comment 12•8 years ago
|
||
Assignee | ||
Updated•8 years ago
|
Attachment #8794380 -
Flags: review?(emorley)
Updated•8 years ago
|
Attachment #8794380 -
Flags: review?(emorley)
Assignee | ||
Comment 13•8 years ago
|
||
Comment on attachment 8794380 [details] [review]
[treeherder] mozilla:requeue-jobs-when-missing-resultset > mozilla:master
Made the updates we talked about. Thanks for looking again! The updates are all in a second commit that I'll squash on merge.
Attachment #8794380 -
Flags: review?(emorley)
Updated•8 years ago
|
Attachment #8794380 -
Flags: review?(emorley)
Comment 14•8 years ago
|
||
Please drop everything else, and put this in production.
Severity: normal → blocker
Flags: needinfo?(cdawson)
Comment 16•8 years ago
|
||
Comment on attachment 8794380 [details] [review]
[treeherder] mozilla:requeue-jobs-when-missing-resultset > mozilla:master
r+ with the pytest.raises() changes :-)
Attachment #8794380 -
Flags: review+
Comment 17•8 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/40d84ee95ddd4d8aacebd1e65536028c2ef22325
Bug 1286578 - Retry job task if resultset doesn't exist (#1872)
* Bug 1286578 - Retry job task if resultset doesn't exist
This removes the logic which creates `skeleton resultsets`
when a job is ingested that we don't have a resultset for yet.
The new approach is to fail and wait for the task to retry.
The buildbot job ingestion already skips and retries later if
it encounters a job for which it has no resultset.
This adds a similar check to the Pulse Job ingestion. If
a job comes in with a revision that doesn't have a resultset
yet, then this will raise a ValueError. That will invoke the
retryable_task actions which will wait a bit, then retry. Each
time it will wait a little longer to retry. After 9 retries it
waits something like 3900 seconds which should be plenty of time
for the resultset ingestion to complete.
Comment 18•8 years ago
|
||
The set of steps that currently have me thinking about closing every Firefox tree until this is deployed are:
* someone pushes a broken chemspill or beta blocker patch to mozilla-release or mozilla-beta, and pastes the push URL in the bug
* a couple hours later, relman checks treeherder for that tree, sees that the tip is green and starred without noticing that it's not that push
* relman hits ShipIt with the cset from the bug, and we ship without ever noticing that failing tests we didn't see were telling us not to
The only thing I can see that would stop that would be if they always, without exception, get the cset they use to ship by copying from Treeherder.
Comment 19•8 years ago
|
||
I wonder as a follow-up, whether we should add a new queue for these "waiting for their push" jobs. It's just I got an alert on Heroku stage about queue backlog for the `store_pulse_jobs` queue after this landed.
With a separate queue we could set different alert thresholds, plus more easily monitor whether any backlog was due to missing lots of pushes, or just because standard ingestion was getting behind.
Assignee | ||
Comment 20•8 years ago
|
||
This has been deployed to both stage and production now. The rabbitmq queues for SCL3 and Heroku stage are hovering at zero, so they're emptying the whole way, atm. Should we see if those seem to grow or we get more alerts? Not sure if that was a one-time thing.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•