Closed Bug 1286578 Opened 9 years ago Closed 8 years ago

Resultset is left with a push_timestamp of 0 after being updated, making it virtually hidden

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Assigned: camd)

References

Details

Attachments

(1 file)

I updated mozilla-central with commits from autoland and that resulted in https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=151aaa2db94d but treeherder don't display this push in the timeline on https://treeherder.mozilla.org/#/jobs?repo=mozilla-central as it would never happened https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=151aaa2db94d3fabe8958b61904371b8f415065d works but it says i pushed jan 1 at 1:00 ? can we look into this because i guess "missing" pushes in the timeline are bad, even when hg looks ok
Taking a look at this now.
For some reason on prod, the record for the resultset didn't get updated. On stage it did. If you look at the same push on allizom.org, it has the right data: https://treeherder.allizom.org/#/jobs?repo=mozilla-central&revision=151aaa2db94d3fabe8958b61904371b8f415065d I'm investigating what went wrong on prod. It could have been a race condition that we need to handle better. I can fix this record, but there is likely an underlying cause that will continue to bite us.
I didn't see any errors on the logs for the etl nodes for pushlogs. I fear this is something that's failing silently. The issue is most likely in this code: https://github.com/mozilla/treeherder/blob/86f4bcbbd4248608fb814e12317b83ad69cca0c7/treeherder/model/derived/jobs.py#L1822-L1822 It's not getting the push information to update, or it doesn't think it NEEDS to update the push. Can't tell which yet. I am going to add some instrumentation to make new-relic events to give me some info here as best I can. I will likely just fix this instance in the DB. I don't think I can tell too much more forensically from its current state.
Fixed in the DB now for that push. The issue looks like on stage, we got the info first from json-pushes. But on prod, we got a job with that revision prior to having the resultset created. So we created the "placeholder" resultset. We then back-filled it with all the revisions. But it didn't update the actual record to set the push_timestamp and author. The timestamp stayed at 0. So something is wrong with the update logic in this scenario.
Assignee: nobody → cdawson
Priority: -- → P1
Hrm.. I have a test case that tests this very scenario, afaict. I can't reproduce this yet. Querying the database, I see a few instances back in May where we created the skeleton resultsets, but never got any revisions for them. So they stayed skeletons. In this case, we got the revisions, but the resultset data itself didn't get updated. querying a few repos in the DB, this is the only instance of this error happening. I'll keep looking for a cause.
For clarity: having skeletons that never get "filled in" would indicate possible rogue jobs. Jobs that have a revision that doesn't actually exist in that repo. This has happened before due to buildbot mistakes in builds4hr. It looks like they were clustered in a couple-day range in may, so perhaps there was just an infra issue that got fixed and we haven't seen it again.
This happened again today on m-c. This bug needs to be fixed, for sure. But perhaps a better fall-back would be the make the push_timestamp be the time that the record was created. then it would at LEAST be in the ball-park of the push, rather 0. This is tough to track down since it's so infrequent. And I have a test for this scenario. But clearly the test is insufficient in some way.
See Also: → 1296091
Summary: merge/hg update from autoland to mozilla-central missing on treeherder → Resultset is left with a push_timestamp of 0 after being updated, making it virtually hidden
Updated the summary to depict the more general problem.
Status: NEW → ASSIGNED
Can we at least setup something where something somewhere alerts if resultsets stay with push_timestamp==0? Then we could at least catch all of these as they happen (or soon after).
Ed had a great idea that instead of creating a skeleton to be able to ingest a job as it comes in, we could just re-queue the job task to rabbitmq until the corresponding resultset exists. I think that might be the best call going forward.
Attachment #8794380 - Flags: review?(emorley)
Attachment #8794380 - Flags: review?(emorley)
Comment on attachment 8794380 [details] [review] [treeherder] mozilla:requeue-jobs-when-missing-resultset > mozilla:master Made the updates we talked about. Thanks for looking again! The updates are all in a second commit that I'll squash on merge.
Attachment #8794380 - Flags: review?(emorley)
Attachment #8794380 - Flags: review?(emorley)
Please drop everything else, and put this in production.
Severity: normal → blocker
Flags: needinfo?(cdawson)
When it's finished the review cycle, yes.
Flags: needinfo?(cdawson)
Comment on attachment 8794380 [details] [review] [treeherder] mozilla:requeue-jobs-when-missing-resultset > mozilla:master r+ with the pytest.raises() changes :-)
Attachment #8794380 - Flags: review+
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/40d84ee95ddd4d8aacebd1e65536028c2ef22325 Bug 1286578 - Retry job task if resultset doesn't exist (#1872) * Bug 1286578 - Retry job task if resultset doesn't exist This removes the logic which creates `skeleton resultsets` when a job is ingested that we don't have a resultset for yet. The new approach is to fail and wait for the task to retry. The buildbot job ingestion already skips and retries later if it encounters a job for which it has no resultset. This adds a similar check to the Pulse Job ingestion. If a job comes in with a revision that doesn't have a resultset yet, then this will raise a ValueError. That will invoke the retryable_task actions which will wait a bit, then retry. Each time it will wait a little longer to retry. After 9 retries it waits something like 3900 seconds which should be plenty of time for the resultset ingestion to complete.
The set of steps that currently have me thinking about closing every Firefox tree until this is deployed are: * someone pushes a broken chemspill or beta blocker patch to mozilla-release or mozilla-beta, and pastes the push URL in the bug * a couple hours later, relman checks treeherder for that tree, sees that the tip is green and starred without noticing that it's not that push * relman hits ShipIt with the cset from the bug, and we ship without ever noticing that failing tests we didn't see were telling us not to The only thing I can see that would stop that would be if they always, without exception, get the cset they use to ship by copying from Treeherder.
I wonder as a follow-up, whether we should add a new queue for these "waiting for their push" jobs. It's just I got an alert on Heroku stage about queue backlog for the `store_pulse_jobs` queue after this landed. With a separate queue we could set different alert thresholds, plus more easily monitor whether any backlog was due to missing lots of pushes, or just because standard ingestion was getting behind.
This has been deployed to both stage and production now. The rabbitmq queues for SCL3 and Heroku stage are hovering at zero, so they're emptying the whole way, atm. Should we see if those seem to grow or we get more alerts? Not sure if that was a one-time thing.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Depends on: 1306578
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: