1286578 - Resultset is left with a push_timestamp of 0 after being updated, making it virtually hidden

Reporter

Description

•

9 years ago

I updated mozilla-central with commits from autoland and that resulted in https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=151aaa2db94d but treeherder don't display this push in the timeline on https://treeherder.mozilla.org/#/jobs?repo=mozilla-central as it would never happened https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=151aaa2db94d3fabe8958b61904371b8f415065d works but it says i pushed jan 1 at 1:00 ? can we look into this because i guess "missing" pushes in the timeline are bad, even when hg looks ok

Cameron Dawson [:camd]

Assignee

Comment 1

•

9 years ago

Taking a look at this now.

Cameron Dawson [:camd]

Assignee

Comment 2

•

9 years ago

For some reason on prod, the record for the resultset didn't get updated. On stage it did. If you look at the same push on allizom.org, it has the right data: https://treeherder.allizom.org/#/jobs?repo=mozilla-central&revision=151aaa2db94d3fabe8958b61904371b8f415065d I'm investigating what went wrong on prod. It could have been a race condition that we need to handle better. I can fix this record, but there is likely an underlying cause that will continue to bite us.

Cameron Dawson [:camd]

Assignee

Comment 3

•

9 years ago

I didn't see any errors on the logs for the etl nodes for pushlogs. I fear this is something that's failing silently. The issue is most likely in this code: https://github.com/mozilla/treeherder/blob/86f4bcbbd4248608fb814e12317b83ad69cca0c7/treeherder/model/derived/jobs.py#L1822-L1822 It's not getting the push information to update, or it doesn't think it NEEDS to update the push. Can't tell which yet. I am going to add some instrumentation to make new-relic events to give me some info here as best I can. I will likely just fix this instance in the DB. I don't think I can tell too much more forensically from its current state.

Cameron Dawson [:camd]

Assignee

Comment 4

•

9 years ago

Fixed in the DB now for that push. The issue looks like on stage, we got the info first from json-pushes. But on prod, we got a job with that revision prior to having the resultset created. So we created the "placeholder" resultset. We then back-filled it with all the revisions. But it didn't update the actual record to set the push_timestamp and author. The timestamp stayed at 0. So something is wrong with the update logic in this scenario.

Cameron Dawson [:camd]

Assignee

Updated

•

9 years ago

Assignee: nobody → cdawson

Priority: -- → P1

Cameron Dawson [:camd]

Assignee

Comment 5

•

9 years ago

Hrm.. I have a test case that tests this very scenario, afaict. I can't reproduce this yet. Querying the database, I see a few instances back in May where we created the skeleton resultsets, but never got any revisions for them. So they stayed skeletons. In this case, we got the revisions, but the resultset data itself didn't get updated. querying a few repos in the DB, this is the only instance of this error happening. I'll keep looking for a cause.

Cameron Dawson [:camd]

Assignee

Comment 6

•

9 years ago

For clarity: having skeletons that never get "filled in" would indicate possible rogue jobs. Jobs that have a revision that doesn't actually exist in that repo. This has happened before due to buildbot mistakes in builds4hr. It looks like they were clustered in a couple-day range in may, so perhaps there was just an infra issue that got fixed and we haven't seen it again.

Cameron Dawson [:camd]

Assignee

Comment 7

•

8 years ago

This happened again today on m-c. This bug needs to be fixed, for sure. But perhaps a better fall-back would be the make the push_timestamp be the time that the record was created. then it would at LEAST be in the ball-park of the push, rather 0. This is tough to track down since it's so infrequent. And I have a test for this scenario. But clearly the test is insufficient in some way.

Cameron Dawson [:camd]

Assignee

Updated

•

8 years ago

Updated

•

8 years ago

Summary: merge/hg update from autoland to mozilla-central missing on treeherder → Resultset is left with a push_timestamp of 0 after being updated, making it virtually hidden

Cameron Dawson [:camd]

Assignee

Comment 8

•

8 years ago

Updated the summary to depict the more general problem.

Status: NEW → ASSIGNED

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 9

•

8 years ago

Can we at least setup something where something somewhere alerts if resultsets stay with push_timestamp==0? Then we could at least catch all of these as they happen (or soon after).

Ryan VanderMeulen [:RyanVM]

Comment 10

•

8 years ago

Happened again: https://treeherder.mozilla.org/#/jobs?repo=fx-team&revision=01ef3a5d4659

Cameron Dawson [:camd]

Assignee

Comment 11

•

8 years ago

Ed had a great idea that instead of creating a skeleton to be able to ingest a job as it comes in, we could just re-queue the job task to rabbitmq until the corresponding resultset exists. I think that might be the best call going forward.

GitHub Autolander Bot

Comment 12

•

8 years ago

Attached file [treeherder] mozilla:requeue-jobs-when-missing-resultset > mozilla:master — Details

Cameron Dawson [:camd]

Assignee

Updated

•

8 years ago

Attachment #8794380 - Flags: review?(emorley)

Ed Morley [:emorley]

Updated

•

8 years ago

Attachment #8794380 - Flags: review?(emorley)

Cameron Dawson [:camd]

Assignee

Comment 13

•

8 years ago

Comment on attachment 8794380 [details] [review] [treeherder] mozilla:requeue-jobs-when-missing-resultset > mozilla:master Made the updates we talked about. Thanks for looking again! The updates are all in a second commit that I'll squash on merge.

Attachment #8794380 - Flags: review?(emorley)

Ed Morley [:emorley]

Updated

•

8 years ago

Attachment #8794380 - Flags: review?(emorley)

Phil Ringnalda (:philor)

Comment 14

•

8 years ago

Please drop everything else, and put this in production.

Severity: normal → blocker

Flags: needinfo?(cdawson)

Ed Morley [:emorley]

Comment 15

•

8 years ago

When it's finished the review cycle, yes.

Flags: needinfo?(cdawson)

Ed Morley [:emorley]

Comment 16

•

8 years ago

Comment on attachment 8794380 [details] [review] [treeherder] mozilla:requeue-jobs-when-missing-resultset > mozilla:master r+ with the pytest.raises() changes :-)

Attachment #8794380 - Flags: review+

Treeherder GitHub Bugbot

Comment 17

•

8 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/40d84ee95ddd4d8aacebd1e65536028c2ef22325 Bug 1286578 - Retry job task if resultset doesn't exist (#1872) * Bug 1286578 - Retry job task if resultset doesn't exist This removes the logic which creates `skeleton resultsets` when a job is ingested that we don't have a resultset for yet. The new approach is to fail and wait for the task to retry. The buildbot job ingestion already skips and retries later if it encounters a job for which it has no resultset. This adds a similar check to the Pulse Job ingestion. If a job comes in with a revision that doesn't have a resultset yet, then this will raise a ValueError. That will invoke the retryable_task actions which will wait a bit, then retry. Each time it will wait a little longer to retry. After 9 retries it waits something like 3900 seconds which should be plenty of time for the resultset ingestion to complete.

Phil Ringnalda (:philor)

Comment 18

•

8 years ago

The set of steps that currently have me thinking about closing every Firefox tree until this is deployed are: * someone pushes a broken chemspill or beta blocker patch to mozilla-release or mozilla-beta, and pastes the push URL in the bug * a couple hours later, relman checks treeherder for that tree, sees that the tip is green and starred without noticing that it's not that push * relman hits ShipIt with the cset from the bug, and we ship without ever noticing that failing tests we didn't see were telling us not to The only thing I can see that would stop that would be if they always, without exception, get the cset they use to ship by copying from Treeherder.

Ed Morley [:emorley]

Comment 19

•

8 years ago

I wonder as a follow-up, whether we should add a new queue for these "waiting for their push" jobs. It's just I got an alert on Heroku stage about queue backlog for the `store_pulse_jobs` queue after this landed. With a separate queue we could set different alert thresholds, plus more easily monitor whether any backlog was due to missing lots of pushes, or just because standard ingestion was getting behind.

Cameron Dawson [:camd]

Assignee

Comment 20

•

8 years ago

This has been deployed to both stage and production now. The rabbitmq queues for SCL3 and Heroku stage are hovering at zero, so they're emptying the whole way, atm. Should we see if those seem to grow or we get more alerts? Not sure if that was a one-time thing.

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Updated

•

8 years ago

Depends on: 1306578