1164545 - Sporadic incorrect previous/next unclassified failure navigation

Reporter

Description

•

10 years ago

KWierso has noticed the past week or so, that occasionally during 'j,k,n,p' next/prev failure navigation, the page scroll will end up with the job outside the page boundary. I think I saw this once last week also. Wes has noted also today his navigation was cycling in a loop of a tiny subset of the visible failed jobs. This was fixed by a page reload. This is just a marker bug to keep an eye out for it, and I or Wes or other sheriffs can add a specific Treeherder resultset url to this bug with a reproducible push, or steps to reproduce.

Jonathan French (:jfrench)

Reporter

Comment 1

•

10 years ago

RyanVM encountered this today on production, and also jobs that were showing up as failures on a push which were showing as running on the pinboard when he manually pinned them. Ryan do you have the url for that specific push, or was it occurring across all mc/mi pushes?

Ryan VanderMeulen [:RyanVM]

Comment 2

•

10 years ago

I've seen it on different pushes at this point. Reloading makes the problem go away.

Jonathan French (:jfrench)

Reporter

Comment 3

•

10 years ago

Upping priority to a P1, per Ryan's request and sheriffing today.

Priority: P3 → P1

Cameron Dawson [:camd]

Assignee

Comment 4

•

10 years ago

I can take a look at this in just a bit.

Jonathan French (:jfrench)

Reporter

Comment 5

•

10 years ago

I can reproduce this with a default view of mozilla-inbound, just leisurely navigating around with n,p for a few minutes (perhaps until the next job poll, not sure). STR: o load https://treeherder.mozila.org o n,p for a few minutes Expected behavior: o Unclassified failures get selected Observed: o A running job periodically gets selected I was able to reproduce it both with jobs pinned, and no jobs pinned and the pinboard closed.

Jonathan French (:jfrench)

Reporter

Comment 6

•

10 years ago

Fwiw one bugfix I see in recent commit history roughly corresponding to this area and timeframe is https://github.com/mozilla/treeherder/commit/bd5cf6d775bf9a2562e778d08043bf49cabdb8ef

Ed Morley [:emorley]

Comment 7

•

10 years ago

(In reply to Jonathan French (:jfrench) from comment #6) > Fwiw one bugfix I see in recent commit history roughly corresponding to this > area and timeframe is > https://github.com/mozilla/treeherder/commit/ > bd5cf6d775bf9a2562e778d08043bf49cabdb8ef That's unrelated, changing the guid won't affect the sort order.

Jonathan French (:jfrench)

Reporter

Comment 8

•

10 years ago

Another shot in the dark, but on the -ui side in this timeframe, perhaps touching this area o commits in bug 1163496 o changes to repository.js on May 4,5,7 https://github.com/mozilla/treeherder-ui/commits/master/webapp/app/js/models/repository.js Probably more noise, but in case it helps :)

Ed Morley [:emorley]

Comment 9

•

10 years ago

Depending on how easy it is to repro, it might be worth bisecting locally on ui repo commits :-)

Jonathan French (:jfrench)

Reporter

Comment 10

•

10 years ago

Yup that was the next step for sure. :) Though hopefully good news; I'm so far unable to reproduce the bug this Monday morning on production with m-i, m-c, and Try, using Nightly, Chrome and Release. It was easy to reproduce on Friday for myself and Ryan. Ryan said Tomcat hasn't seen it either today during sheriffing.

Jonathan French (:jfrench)

Reporter

Comment 11

•

10 years ago

I've tested the past 3 mornings, and haven't seen it on m-c,m-i,try,etc since last Friday. RyanVM experienced it this morning on m-c (the same n,p incorrectly selecting a 'running' job) but I was unable to reproduce it. I spent a good 30-40min trying using Nightly, Chrome, signed-in/out, pinning, filters, saving classifications with n,p navigation and waiting for new job ingestion but so far it's been working correctly for me. Hopefully we will arrive on a reliable STR.

William Lachance (:wlach)

Comment 12

•

10 years ago

I'm 99% certain the issue here is with updating existing result sets from resultset_store. I was able to reproduce scanning over at the least an "unclassified" result set (it turns out that it wasn't unclassified, it had just been updated to "fail" without its display changing). One oddity is that it looks like we're not sorting the job list by id, so it's possible that we might slightly change the order of job items as updates come in. I can't think of any cases where this would cause a problem, but hey, maybe I'm missing something: https://github.com/mozilla/treeherder/blob/master/ui/js/models/resultsets_store.js#L163 (we do various kinds of sorting and categorization after that, yes, but nothing to do specifically with ids) Our testing of updates in the frontend is pretty much nonexistent. It might be worth adding some functional tests of the various elements of resultsets_store.js, etc. -- making sure that things change as expected when the data model changes underneath.

Cameron Dawson [:camd]

Assignee

Updated

•

10 years ago

Assignee: nobody → cdawson

Cameron Dawson [:camd]

Assignee

Comment 13

•

10 years ago

Dang, I'm not able to reproduce this. I'd appreciate if someone could help me do so. If we can get a time where you can reproduce, then please try this: Edit clonejobs.js in the ``selectJob`` function to have ``console.log("selected job", job);`` I'd be curious what the job looks like. Especially in comparison to selecting the job on a new page. Is the job showing as an unclassified failure there? If so, does the console print of the job in both places look the same? If they do, then we're updating them ok, but probably not properly triggering a re-render in clonejobs for that job's platform row. Please ping me in channel and we can do some experimenting. :)

Cameron Dawson [:camd]

Assignee

Comment 14

•

10 years ago

fwiw: I have a branch that has much better selection logic for those keystrokes. But it sounds like the problem is actually the job not getting updated. Not that we're selecting the wrong job?

Jonathan French (:jfrench)

Reporter

Comment 15

•

10 years ago

I'll have a go tomorrow, in case I can still reproduce it.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 16

•

10 years ago

I hit it again on prod a couple times tonight.

Jonathan French (:jfrench)

Reporter

Comment 17

•

10 years ago

I spent ~1hr trying to repro it this morning, on production, stage, and with a local gunicorn server pointing to production data, using Nightly and Chrome, with m-i, m-c, try, default resultsets, get|next up to 120 resultsets, and various sheriff-like workflows, but n,p navigation appears normal for me this morning. I didn't hit any running jobs. I definitely hit it myself several Fridays ago though.

Cameron Dawson [:camd]

Assignee

Comment 18

•

10 years ago

Attached file entry directly from builds4hr — Details

Cameron Dawson [:camd]

Assignee

Comment 19

•

10 years ago

So this happened for me on a job that started as running revision 879bc9b3298d but when completed, changed to revision 7f649ef7ef39 (the next more recent push). (build artifacts for running and builds4hr are attached) https://treeherder.mozilla.org/api/project/mozilla-inbound/artifact/?job_id=10239430&type=json&name__in=buildapi_running,buildapi_complete

Cameron Dawson [:camd]

Assignee

Comment 20

•

10 years ago

The result this has in the UI is that the job is added to the DOM in one revision while running, but when it completes as failed, it's added to the DOM under the newer revision. But it still has the same key<job_id> value in both places in the DOM. The job is is still the same, just the result_set_id has changed. When our code to find the next unclassified goes through, it's looking at each DOM job object, getting the key from the DOM attribute and then looking that up in the job map inside the ``resultset_store``. When the job was updated to the new revision, it updated the job map with the new result. But because it has a new revision, we never knew to re-render the jobs for the OLD revision. So they stay in the DOM. When we come to the job in the DOM for the old, running version, it has the same job map key. We get it from the job map, and find that it's result is failed and select it.

Cameron Dawson [:camd]

Assignee

Comment 21

•

10 years ago

Attached file running artifact — Details

Cameron Dawson [:camd]

Assignee

Comment 22

•

10 years ago

Ed: as far as I know, with coalescing, that only happens on pending jobs. Once a job is running, it's not going to change the revision it belongs to. Does this sound like a buildbot bug to you?

Flags: needinfo?(emorley)

Cameron Dawson [:camd]

Assignee

Comment 23

•

10 years ago

Attached file ui PR — Details

This PR will fix it on our end.

Attachment #8612630 - Flags: review?(emorley)

Ed Morley [:emorley]

Comment 24

•

10 years ago

(In reply to Cameron Dawson [:camd] from comment #22) > Ed: as far as I know, with coalescing, that only happens on pending jobs. > Once a job is running, it's not going to change the revision it belongs to. > Does this sound like a buildbot bug to you? A job definitely shouldn't change its revision - the only thing that will change is a pending job might get marked as coalesced, and the job it was coalesced into was on another revision. But either way that coalesced job should still be associated with its original revision.

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Comment 25

•

10 years ago

Comment on attachment 8612630 [details] [review] ui PR Using CSS classes to choose which job to navigate to seems really hacky. Or is this the way lots of webapps do it? Not sure I feel confident saying this is the best approach; could you ask Will or Mauro instead? :-) That said, the revision changing is definitely a bug - we need to (a) file a buildbot bug, (b) log/complain about it more loudly during ingestion.

Attachment #8612630 - Flags: review?(emorley)

Cameron Dawson [:camd]

Assignee

Updated

•

10 years ago

Depends on: 1169663

Cameron Dawson [:camd]

Assignee

Comment 26

•

10 years ago

Comment on attachment 8612630 [details] [review] ui PR Hey Mauro and Will-- Wanted to get your feedback on this PR, if you have time. Ed has some concerns with using jquery to do this. I personally prefer it a lot to what we were doing before. But I'd love to hear your opinions. Thanks! :)

Attachment #8612630 - Flags: review?(mdoglio)

Attachment #8612630 - Flags: feedback?(wlachance)

Cameron Dawson [:camd]

Assignee

Comment 27

•

10 years ago

Bug 1169663 filed against buildbot.

William Lachance (:wlach)

Comment 28

•

10 years ago

Comment on attachment 8612630 [details] [review] ui PR I suppose in theory I prefer an approach which works on the data model level and propagates changes up to the view, but meh... this approach seems pretty comprehensible and I'm a pragmatic person. I'm interested in hearing mdoglio's opinion, but at this point I'm inclined to say "ship it".

Attachment #8612630 - Flags: feedback?(wlachance) → feedback+

Cameron Dawson [:camd]

Assignee

Comment 29

•

10 years ago

This PR will also fix Bug 1169646

Mauro Doglio [:mdoglio]

Updated

•

10 years ago

Attachment #8612630 - Flags: review?(mdoglio) → review+

Treeherder GitHub Bugbot

Comment 30

•

10 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/20f14b8099bb945c3ab9599a8544ac7afc132177 Bug 1164545 - Sporadic incorrect previous/next unclassified

Cameron Dawson [:camd]

Assignee

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Updated

•

10 years ago

Blocks: 1110559

Ed Morley [:emorley]

Comment 31

•

10 years ago

(In reply to Cameron Dawson [:camd] from comment #20) > When the job was updated to the new revision, it updated the job map with > the new result. But because it has a new revision, we never knew to > re-render the jobs for the OLD revision. So they stay in the DOM. > > When we come to the job in the DOM for the old, running version, it has the > same job map key. We get it from the job map, and find that it's result is > failed and select it. This sounds like another bug we need to fix in the UI - is that correct? :-)

Ed Morley [:emorley]

Updated

•

10 years ago

Flags: needinfo?(cdawson)

Cameron Dawson [:camd]

Assignee

Comment 32

•

10 years ago

(In reply to Ed Morley [:emorley] from comment #31) > (In reply to Cameron Dawson [:camd] from comment #20) > > When the job was updated to the new revision, it updated the job map with > > the new result. But because it has a new revision, we never knew to > > re-render the jobs for the OLD revision. So they stay in the DOM. > > > > When we come to the job in the DOM for the old, running version, it has the > > same job map key. We get it from the job map, and find that it's result is > > failed and select it. > > This sounds like another bug we need to fix in the UI - is that correct? :-) So this is a result of that bad data coming in (revision has changed on a job). It's possible that the UI could detect this because the job_guid is the same and attempt to handle it appropriately. Presumably it just wouldn't update the revision value for the job. But it's really an upstream data issue. If we want to handle this contingency of bad data, we should probably handle it in the service instead. Opened bug 1171081 to cover that decision and impl.

Flags: needinfo?(cdawson)

Ed Morley [:emorley]

Updated

•

10 years ago

No longer depends on: 1169663

Ed Morley [:emorley]

Updated

•

10 years ago

Depends on: 782874

entry directly from builds4hr 10 years ago Cameron Dawson [:camd] 2.96 KB, text/plain		Details
running artifact 10 years ago Cameron Dawson [:camd] 665 bytes, text/plain		Details
ui PR 10 years ago Cameron Dawson [:camd] 46 bytes, text/x-github-pull-request	mdoglio : review+ wlach : feedback+	Details \| Review