Closed Bug 1272532 Opened 6 years ago Closed 6 years ago

exceptions:KeyError: in jobs._load_jobs

Categories

(Tree Management :: Treeherder: Data Ingestion, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: camd, Assigned: wlach)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

This seems to be happening when we get jobs and retries of that job in the same chunk of builds4hr.  This happens on both stage and prod.  For example:

exceptions:KeyError: '64600b5dba1a152c60205c63ea03af8400f9e95d'

The retry job in question that's trying to be set to retried AND completed is here:
https://treeherder.mozilla.org/#/jobs?repo=fx-team&revision=a97eec37ec2d4ded719ad996d75c431b7043fa5a&filter-searchStr=44f22fd30969183ab9c21f8ef6d95fb509345b32&selectedJob=9314573

This seems to have started with the change in commit:
https://github.com/mozilla/treeherder/commit/104787282d0db965ea948342ce1f2926936f3d27#diff-a03f6e7c1747a7ef236e211422cc7c5eR1781

The logic changed there with handling retries.  I'm still looking into it.
It's possible we could get away with just changing line 1781 to be:

    if retry_guid_root in jobs_to_update:
        del jobs_to_update[retry_guid_root]

But I don't quite understand why that guid isn't in there.  I *think* it's because the completed and retried are in the same batch of jobs being submitted.  

The logic changed here a bit.  But I think this is the problem:

https://github.com/mozilla/treeherder/commit/104787282d0db965ea948342ce1f2926936f3d27#diff-a03f6e7c1747a7ef236e211422cc7c5eR1764

That line is still referencing ``job_id_lookup`` but in the old code, ``job_id_lookup`` was getting modified in the for-loop.  Now we're modifying ``jobs_to_update``, but still getting the keys from ``job_id_lookup``, so the list of keys may not exist in ``jobs_to_update``

Will: Would you check my reasoning on this?
Flags: needinfo?(wlachance)
I should have mentioned in comment 1, the fix might be just to change:

lookup_keys = job_id_lookup.keys()

to

lookup_keys = jobs_to_update.keys()
Thank you for filing this :-)
(In reply to Cameron Dawson [:camd] from comment #1)
> It's possible we could get away with just changing line 1781 to be:
> 
>     if retry_guid_root in jobs_to_update:
>         del jobs_to_update[retry_guid_root]
> 
> But I don't quite understand why that guid isn't in there.  I *think* it's
> because the completed and retried are in the same batch of jobs being
> submitted.  
> 
> The logic changed here a bit.  But I think this is the problem:
> 
> https://github.com/mozilla/treeherder/commit/
> 104787282d0db965ea948342ce1f2926936f3d27#diff-
> a03f6e7c1747a7ef236e211422cc7c5eR1764
> 
> That line is still referencing ``job_id_lookup`` but in the old code,
> ``job_id_lookup`` was getting modified in the for-loop.  Now we're modifying
> ``jobs_to_update``, but still getting the keys from ``job_id_lookup``, so
> the list of keys may not exist in ``jobs_to_update``
> 
> Will: Would you check my reasoning on this?

This makes sense to me. I would create a unit test that reproduces this problem, then fix it. :)
Flags: needinfo?(wlachance)
Duplicate of this bug: 1274191
Since I caused this problem, it's probably fair that I take responsibility for fixing it.
Assignee: nobody → wlachance
I believe the above PR identifies and fixes the problem. I'm going to postpone asking for review until I've had a chance to give this more thought, but leaving the solution here in case there's an emergency.
(In reply to William Lachance (:wlach) from comment #8)
> I believe the above PR identifies and fixes the problem. I'm going to
> postpone asking for review until I've had a chance to give this more
> thought, but leaving the solution here in case there's an emergency.

For example, I'm somewhat concerned about the possibility of there being two retry guids in the first place -- is there anything we can/should do to try to prevent that from happening? This will probably be easier to fix at the root once we move away from datasource.
Comment on attachment 8754517 [details] [review]
[treeherder] wlach:1272532 > mozilla:master

As discussed, I'm not sure if this 100% fixes the problem but spending more time on this doesn't seem like a good investment (given that we'll be changing the way job ingestion works completely when we move away from datasource)
Attachment #8754517 - Flags: review?(emorley)
Attachment #8754517 - Flags: review?(emorley) → review+
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/4d143c055f6ebfec3b2541125c0a45213c3705fb
Bug 1272532 - Handle case where we get two different retry guids for a job

This is an obscure case (that we should probably handle better), but I
believe this patch should at least fix the new relic exception in the
common cases (specifically when we get the same retry with a different
guid in the same batch of jobs). This patch also adds a bunch of unit
tests to test ingesting batches of jobs containing retries. There are
many cases which still fail (which would probably never occur in real
life), but I'm going to put off fixing those for now -- see comments
in the unit tests for details.

https://github.com/mozilla/treeherder/commit/969d063b522ac560e767468ecc4549a9a5cea164
Merge pull request #1507 from wlach/1272532

Bug 1272532 - Handle case where we get two different retry guids for a job
Fixed enough for now.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.