Closed Bug 1164545 Opened 9 years ago Closed 9 years ago

Sporadic incorrect previous/next unclassified failure navigation

Categories

(Tree Management :: Treeherder, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jfrench, Assigned: camd)

References

()

Details

Attachments

(3 files)

KWierso has noticed the past week or so, that occasionally during 'j,k,n,p' next/prev failure navigation, the page scroll will end up with the job outside the page boundary.

I think I saw this once last week also.

Wes has noted also today his navigation was cycling in a loop of a tiny subset of the visible failed jobs. This was fixed by a page reload.

This is just a marker bug to keep an eye out for it, and I or Wes or other sheriffs can add a specific Treeherder resultset url to this bug with a reproducible push, or steps to reproduce.
RyanVM encountered this today on production, and also jobs that were showing up as failures on a push which were showing as running on the pinboard when he manually pinned them.

Ryan do you have the url for that specific push, or was it occurring across all mc/mi pushes?
I've seen it on different pushes at this point. Reloading makes the problem go away.
Upping priority to a P1, per Ryan's request and sheriffing today.
Priority: P3 → P1
I can take a look at this in just a bit.
I can reproduce this with a default view of mozilla-inbound, just leisurely navigating around with n,p for a few minutes (perhaps until the next job poll, not sure).

STR:
o load https://treeherder.mozila.org
o n,p for a few minutes

Expected behavior:
o Unclassified failures get selected

Observed:
o A running job periodically gets selected

I was able to reproduce it both with jobs pinned, and no jobs pinned and the pinboard closed.
Fwiw one bugfix I see in recent commit history roughly corresponding to this area and timeframe is https://github.com/mozilla/treeherder/commit/bd5cf6d775bf9a2562e778d08043bf49cabdb8ef
(In reply to Jonathan French (:jfrench) from comment #6)
> Fwiw one bugfix I see in recent commit history roughly corresponding to this
> area and timeframe is
> https://github.com/mozilla/treeherder/commit/
> bd5cf6d775bf9a2562e778d08043bf49cabdb8ef

That's unrelated, changing the guid won't affect the sort order.
Another shot in the dark, but on the -ui side in this timeframe, perhaps touching this area

o commits in bug 1163496
o changes to repository.js on May 4,5,7
https://github.com/mozilla/treeherder-ui/commits/master/webapp/app/js/models/repository.js

Probably more noise, but in case it helps :)
Depending on how easy it is to repro, it might be worth bisecting locally on ui repo commits :-)
Yup that was the next step for sure. :)

Though hopefully good news; I'm so far unable to reproduce the bug this Monday morning on production with m-i, m-c, and Try, using Nightly, Chrome and Release. It was easy to reproduce on Friday for myself and Ryan.

Ryan said Tomcat hasn't seen it either today during sheriffing.
I've tested the past 3 mornings, and haven't seen it on m-c,m-i,try,etc since last Friday. RyanVM experienced it this morning on m-c (the same n,p incorrectly selecting a 'running' job) but I was unable to reproduce it.

I spent a good 30-40min trying using Nightly, Chrome, signed-in/out, pinning, filters, saving classifications with n,p navigation and waiting for new job ingestion but so far it's been working correctly for me. Hopefully we will arrive on a reliable STR.
I'm 99% certain the issue here is with updating existing result sets from resultset_store. I was able to reproduce scanning over at the least an "unclassified" result set (it turns out that it wasn't unclassified, it had just been updated to "fail" without its display changing).

One oddity is that it looks like we're not sorting the job list by id, so it's possible that we might slightly change the order of job items as updates come in. I can't think of any cases where this would cause a problem, but hey, maybe I'm missing something:

https://github.com/mozilla/treeherder/blob/master/ui/js/models/resultsets_store.js#L163

(we do various kinds of sorting and categorization after that, yes, but nothing to do specifically with ids)

Our testing of updates in the frontend is pretty much nonexistent. It might be worth adding some functional tests of the various elements of resultsets_store.js, etc. -- making sure that things change as expected when the data model changes underneath.
Assignee: nobody → cdawson
Dang, I'm not able to reproduce this.  I'd appreciate if someone could help me do so.  If we can get a time where you can reproduce, then please try this:

Edit clonejobs.js in the ``selectJob`` function to have ``console.log("selected job", job);``

I'd be curious what the job looks like.  Especially in comparison to selecting the job on a new page.  Is the job showing as an unclassified failure there?  If so, does the console print of the job in both places look the same?

If they do, then we're updating them ok, but probably not properly triggering a re-render in clonejobs for that job's platform row.  

Please ping me in channel and we can do some experimenting.  :)
fwiw: I have a branch that has much better selection logic for those keystrokes.  But it sounds like the problem is actually the job not getting updated.  Not that we're selecting the wrong job?
I'll have a go tomorrow, in case I can still reproduce it.
I hit it again on prod a couple times tonight.
I spent ~1hr trying to repro it this morning, on production, stage, and with a local gunicorn server pointing to production data, using Nightly and Chrome, with m-i, m-c, try, default resultsets, get|next up to 120 resultsets, and various sheriff-like workflows, but n,p navigation appears normal for me this morning. I didn't hit any running jobs.

I definitely hit it myself several Fridays ago though.
So this happened for me on a job that started as running revision 879bc9b3298d but when completed, changed to revision 7f649ef7ef39 (the next more recent push).  (build artifacts for running and builds4hr are attached)

https://treeherder.mozilla.org/api/project/mozilla-inbound/artifact/?job_id=10239430&type=json&name__in=buildapi_running,buildapi_complete
The result this has in the UI is that the job is added to the DOM in one revision while running, but when it completes as failed, it's added to the DOM under the newer revision.  But it still has the same key<job_id> value in both places in the DOM.  The job is is still the same, just the result_set_id has changed.

When our code to find the next unclassified goes through, it's looking at each DOM job object, getting the key from the DOM attribute and then looking that up in the job map inside the ``resultset_store``.  

When the job was updated to the new revision, it updated the job map with the new result.  But because it has a new revision, we never knew to re-render the jobs for the OLD revision.  So they stay in the DOM.

When we come to the job in the DOM for the old, running version, it has the same job map key.  We get it from the job map, and find that it's result is failed and select it.
Attached file running artifact
Ed: as far as I know, with coalescing, that only happens on pending jobs.  Once a job is running, it's not going to change the revision it belongs to.  Does this sound like a buildbot bug to you?
Flags: needinfo?(emorley)
Attached file ui PR
This PR will fix it on our end.
Attachment #8612630 - Flags: review?(emorley)
(In reply to Cameron Dawson [:camd] from comment #22)
> Ed: as far as I know, with coalescing, that only happens on pending jobs. 
> Once a job is running, it's not going to change the revision it belongs to. 
> Does this sound like a buildbot bug to you?

A job definitely shouldn't change its revision - the only thing that will change is a pending job might get marked as coalesced, and the job it was coalesced into was on another revision. But either way that coalesced job should still be associated with its original revision.
Flags: needinfo?(emorley)
Comment on attachment 8612630 [details] [review]
ui PR

Using CSS classes to choose which job to navigate to seems really hacky. Or is this the way lots of webapps do it? Not sure I feel confident saying this is the best approach; could you ask Will or Mauro instead? :-)

That said, the revision changing is definitely a bug - we need to (a) file a buildbot bug, (b) log/complain about it more loudly during ingestion.
Attachment #8612630 - Flags: review?(emorley)
Depends on: 1169663
Comment on attachment 8612630 [details] [review]
ui PR

Hey Mauro and Will-- Wanted to get your feedback on this PR, if you have time.  Ed has some concerns with using jquery to do this.  I personally prefer it a lot to what we were doing before.  But I'd love to hear your opinions.  Thanks!  :)
Attachment #8612630 - Flags: review?(mdoglio)
Attachment #8612630 - Flags: feedback?(wlachance)
Bug 1169663 filed against buildbot.
Comment on attachment 8612630 [details] [review]
ui PR

I suppose in theory I prefer an approach which works on the data model level and propagates changes up to the view, but meh... this approach seems pretty comprehensible and I'm a pragmatic person. I'm interested in hearing mdoglio's opinion, but at this point I'm inclined to say "ship it".
Attachment #8612630 - Flags: feedback?(wlachance) → feedback+
This PR will also fix Bug 1169646
Attachment #8612630 - Flags: review?(mdoglio) → review+
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1110559
(In reply to Cameron Dawson [:camd] from comment #20)
> When the job was updated to the new revision, it updated the job map with
> the new result.  But because it has a new revision, we never knew to
> re-render the jobs for the OLD revision.  So they stay in the DOM.
> 
> When we come to the job in the DOM for the old, running version, it has the
> same job map key.  We get it from the job map, and find that it's result is
> failed and select it.

This sounds like another bug we need to fix in the UI - is that correct? :-)
Flags: needinfo?(cdawson)
(In reply to Ed Morley [:emorley] from comment #31)
> (In reply to Cameron Dawson [:camd] from comment #20)
> > When the job was updated to the new revision, it updated the job map with
> > the new result.  But because it has a new revision, we never knew to
> > re-render the jobs for the OLD revision.  So they stay in the DOM.
> > 
> > When we come to the job in the DOM for the old, running version, it has the
> > same job map key.  We get it from the job map, and find that it's result is
> > failed and select it.
> 
> This sounds like another bug we need to fix in the UI - is that correct? :-)

So this is a result of that bad data coming in (revision has changed on a job).  It's possible that the UI could detect this because the job_guid is the same and attempt to handle it appropriately.  Presumably it just wouldn't update the revision value for the job.  But it's really an upstream data issue.  

If we want to handle this contingency of bad data, we should probably handle it in the service instead.  

Opened bug 1171081 to cover that decision and impl.
Flags: needinfo?(cdawson)
No longer depends on: 1169663
Depends on: 782874
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: