Closed
Bug 1164545
Opened 10 years ago
Closed 10 years ago
Sporadic incorrect previous/next unclassified failure navigation
Categories
(Tree Management :: Treeherder, defect, P1)
Tree Management
Treeherder
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jfrench, Assigned: camd)
References
()
Details
Attachments
(3 files)
KWierso has noticed the past week or so, that occasionally during 'j,k,n,p' next/prev failure navigation, the page scroll will end up with the job outside the page boundary.
I think I saw this once last week also.
Wes has noted also today his navigation was cycling in a loop of a tiny subset of the visible failed jobs. This was fixed by a page reload.
This is just a marker bug to keep an eye out for it, and I or Wes or other sheriffs can add a specific Treeherder resultset url to this bug with a reproducible push, or steps to reproduce.
Reporter | ||
Comment 1•10 years ago
|
||
RyanVM encountered this today on production, and also jobs that were showing up as failures on a push which were showing as running on the pinboard when he manually pinned them.
Ryan do you have the url for that specific push, or was it occurring across all mc/mi pushes?
Comment 2•10 years ago
|
||
I've seen it on different pushes at this point. Reloading makes the problem go away.
Reporter | ||
Comment 3•10 years ago
|
||
Upping priority to a P1, per Ryan's request and sheriffing today.
Priority: P3 → P1
Assignee | ||
Comment 4•10 years ago
|
||
I can take a look at this in just a bit.
Reporter | ||
Comment 5•10 years ago
|
||
I can reproduce this with a default view of mozilla-inbound, just leisurely navigating around with n,p for a few minutes (perhaps until the next job poll, not sure).
STR:
o load https://treeherder.mozila.org
o n,p for a few minutes
Expected behavior:
o Unclassified failures get selected
Observed:
o A running job periodically gets selected
I was able to reproduce it both with jobs pinned, and no jobs pinned and the pinboard closed.
Reporter | ||
Comment 6•10 years ago
|
||
Fwiw one bugfix I see in recent commit history roughly corresponding to this area and timeframe is https://github.com/mozilla/treeherder/commit/bd5cf6d775bf9a2562e778d08043bf49cabdb8ef
Comment 7•10 years ago
|
||
(In reply to Jonathan French (:jfrench) from comment #6)
> Fwiw one bugfix I see in recent commit history roughly corresponding to this
> area and timeframe is
> https://github.com/mozilla/treeherder/commit/
> bd5cf6d775bf9a2562e778d08043bf49cabdb8ef
That's unrelated, changing the guid won't affect the sort order.
Reporter | ||
Comment 8•10 years ago
|
||
Another shot in the dark, but on the -ui side in this timeframe, perhaps touching this area
o commits in bug 1163496
o changes to repository.js on May 4,5,7
https://github.com/mozilla/treeherder-ui/commits/master/webapp/app/js/models/repository.js
Probably more noise, but in case it helps :)
Comment 9•10 years ago
|
||
Depending on how easy it is to repro, it might be worth bisecting locally on ui repo commits :-)
Reporter | ||
Comment 10•10 years ago
|
||
Yup that was the next step for sure. :)
Though hopefully good news; I'm so far unable to reproduce the bug this Monday morning on production with m-i, m-c, and Try, using Nightly, Chrome and Release. It was easy to reproduce on Friday for myself and Ryan.
Ryan said Tomcat hasn't seen it either today during sheriffing.
Reporter | ||
Comment 11•10 years ago
|
||
I've tested the past 3 mornings, and haven't seen it on m-c,m-i,try,etc since last Friday. RyanVM experienced it this morning on m-c (the same n,p incorrectly selecting a 'running' job) but I was unable to reproduce it.
I spent a good 30-40min trying using Nightly, Chrome, signed-in/out, pinning, filters, saving classifications with n,p navigation and waiting for new job ingestion but so far it's been working correctly for me. Hopefully we will arrive on a reliable STR.
Comment 12•10 years ago
|
||
I'm 99% certain the issue here is with updating existing result sets from resultset_store. I was able to reproduce scanning over at the least an "unclassified" result set (it turns out that it wasn't unclassified, it had just been updated to "fail" without its display changing).
One oddity is that it looks like we're not sorting the job list by id, so it's possible that we might slightly change the order of job items as updates come in. I can't think of any cases where this would cause a problem, but hey, maybe I'm missing something:
https://github.com/mozilla/treeherder/blob/master/ui/js/models/resultsets_store.js#L163
(we do various kinds of sorting and categorization after that, yes, but nothing to do specifically with ids)
Our testing of updates in the frontend is pretty much nonexistent. It might be worth adding some functional tests of the various elements of resultsets_store.js, etc. -- making sure that things change as expected when the data model changes underneath.
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → cdawson
Assignee | ||
Comment 13•10 years ago
|
||
Dang, I'm not able to reproduce this. I'd appreciate if someone could help me do so. If we can get a time where you can reproduce, then please try this:
Edit clonejobs.js in the ``selectJob`` function to have ``console.log("selected job", job);``
I'd be curious what the job looks like. Especially in comparison to selecting the job on a new page. Is the job showing as an unclassified failure there? If so, does the console print of the job in both places look the same?
If they do, then we're updating them ok, but probably not properly triggering a re-render in clonejobs for that job's platform row.
Please ping me in channel and we can do some experimenting. :)
Assignee | ||
Comment 14•10 years ago
|
||
fwiw: I have a branch that has much better selection logic for those keystrokes. But it sounds like the problem is actually the job not getting updated. Not that we're selecting the wrong job?
Reporter | ||
Comment 15•10 years ago
|
||
I'll have a go tomorrow, in case I can still reproduce it.
I hit it again on prod a couple times tonight.
Reporter | ||
Comment 17•10 years ago
|
||
I spent ~1hr trying to repro it this morning, on production, stage, and with a local gunicorn server pointing to production data, using Nightly and Chrome, with m-i, m-c, try, default resultsets, get|next up to 120 resultsets, and various sheriff-like workflows, but n,p navigation appears normal for me this morning. I didn't hit any running jobs.
I definitely hit it myself several Fridays ago though.
Assignee | ||
Comment 18•10 years ago
|
||
Assignee | ||
Comment 19•10 years ago
|
||
So this happened for me on a job that started as running revision 879bc9b3298d but when completed, changed to revision 7f649ef7ef39 (the next more recent push). (build artifacts for running and builds4hr are attached)
https://treeherder.mozilla.org/api/project/mozilla-inbound/artifact/?job_id=10239430&type=json&name__in=buildapi_running,buildapi_complete
Assignee | ||
Comment 20•10 years ago
|
||
The result this has in the UI is that the job is added to the DOM in one revision while running, but when it completes as failed, it's added to the DOM under the newer revision. But it still has the same key<job_id> value in both places in the DOM. The job is is still the same, just the result_set_id has changed.
When our code to find the next unclassified goes through, it's looking at each DOM job object, getting the key from the DOM attribute and then looking that up in the job map inside the ``resultset_store``.
When the job was updated to the new revision, it updated the job map with the new result. But because it has a new revision, we never knew to re-render the jobs for the OLD revision. So they stay in the DOM.
When we come to the job in the DOM for the old, running version, it has the same job map key. We get it from the job map, and find that it's result is failed and select it.
Assignee | ||
Comment 21•10 years ago
|
||
Assignee | ||
Comment 22•10 years ago
|
||
Ed: as far as I know, with coalescing, that only happens on pending jobs. Once a job is running, it's not going to change the revision it belongs to. Does this sound like a buildbot bug to you?
Flags: needinfo?(emorley)
Assignee | ||
Comment 23•10 years ago
|
||
This PR will fix it on our end.
Attachment #8612630 -
Flags: review?(emorley)
Comment 24•10 years ago
|
||
(In reply to Cameron Dawson [:camd] from comment #22)
> Ed: as far as I know, with coalescing, that only happens on pending jobs.
> Once a job is running, it's not going to change the revision it belongs to.
> Does this sound like a buildbot bug to you?
A job definitely shouldn't change its revision - the only thing that will change is a pending job might get marked as coalesced, and the job it was coalesced into was on another revision. But either way that coalesced job should still be associated with its original revision.
Flags: needinfo?(emorley)
Comment 25•10 years ago
|
||
Comment on attachment 8612630 [details] [review]
ui PR
Using CSS classes to choose which job to navigate to seems really hacky. Or is this the way lots of webapps do it? Not sure I feel confident saying this is the best approach; could you ask Will or Mauro instead? :-)
That said, the revision changing is definitely a bug - we need to (a) file a buildbot bug, (b) log/complain about it more loudly during ingestion.
Attachment #8612630 -
Flags: review?(emorley)
Assignee | ||
Comment 26•10 years ago
|
||
Comment on attachment 8612630 [details] [review]
ui PR
Hey Mauro and Will-- Wanted to get your feedback on this PR, if you have time. Ed has some concerns with using jquery to do this. I personally prefer it a lot to what we were doing before. But I'd love to hear your opinions. Thanks! :)
Attachment #8612630 -
Flags: review?(mdoglio)
Attachment #8612630 -
Flags: feedback?(wlachance)
Assignee | ||
Comment 27•10 years ago
|
||
Bug 1169663 filed against buildbot.
Comment 28•10 years ago
|
||
Comment on attachment 8612630 [details] [review]
ui PR
I suppose in theory I prefer an approach which works on the data model level and propagates changes up to the view, but meh... this approach seems pretty comprehensible and I'm a pragmatic person. I'm interested in hearing mdoglio's opinion, but at this point I'm inclined to say "ship it".
Attachment #8612630 -
Flags: feedback?(wlachance) → feedback+
Assignee | ||
Comment 29•10 years ago
|
||
This PR will also fix Bug 1169646
Updated•10 years ago
|
Attachment #8612630 -
Flags: review?(mdoglio) → review+
Comment 30•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/20f14b8099bb945c3ab9599a8544ac7afc132177
Bug 1164545 - Sporadic incorrect previous/next unclassified
Assignee | ||
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 31•10 years ago
|
||
(In reply to Cameron Dawson [:camd] from comment #20)
> When the job was updated to the new revision, it updated the job map with
> the new result. But because it has a new revision, we never knew to
> re-render the jobs for the OLD revision. So they stay in the DOM.
>
> When we come to the job in the DOM for the old, running version, it has the
> same job map key. We get it from the job map, and find that it's result is
> failed and select it.
This sounds like another bug we need to fix in the UI - is that correct? :-)
Updated•10 years ago
|
Flags: needinfo?(cdawson)
Assignee | ||
Comment 32•10 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #31)
> (In reply to Cameron Dawson [:camd] from comment #20)
> > When the job was updated to the new revision, it updated the job map with
> > the new result. But because it has a new revision, we never knew to
> > re-render the jobs for the OLD revision. So they stay in the DOM.
> >
> > When we come to the job in the DOM for the old, running version, it has the
> > same job map key. We get it from the job map, and find that it's result is
> > failed and select it.
>
> This sounds like another bug we need to fix in the UI - is that correct? :-)
So this is a result of that bad data coming in (revision has changed on a job). It's possible that the UI could detect this because the job_guid is the same and attempt to handle it appropriately. Presumably it just wouldn't update the revision value for the job. But it's really an upstream data issue.
If we want to handle this contingency of bad data, we should probably handle it in the service instead.
Opened bug 1171081 to cover that decision and impl.
Flags: needinfo?(cdawson)
You need to log in
before you can comment on or make changes to this bug.
Description
•