791385 - Add an hg hook to the try server in order to prevent having multiple try jobs in flight by default

Reporter

Description

•

13 years ago

We are interested in doing this in order to cut down the infrastructure load caused by people pushing multiple patches to try without realizing that their previous jobs are still active. Chris has written the hook. mconnor is supposed to write the announcement to dev.something.

Chris AtLee [:catlee]

Comment 1

•

13 years ago

Attached file le hook — Details

Chris AtLee [:catlee]

Updated

•

13 years ago

Attachment #661385 - Attachment mime type: text/x-python → text/plain

Phil Ringnalda (:philor)

Comment 2

•

13 years ago

What problem are we solving here? I can see it as a clear solution to "we have 280 developers with try access, and 280 machines, so if you push twice you are taking someone else's machine and you need to realize it" but that doesn't strike me as being a problem that we have. We certainly have problems like "we have a shocking number of people with try access who do not know that they can retrigger tests, who believe that the way to retrigger an intermittent failure is to push again," but this doesn't solve that, it just makes them wait, or qref and not wait. Are we solving "people push, see the result from the first platform, realize they need to change foo, and push with foo changed without killing the first push"? This seems like a massive annoyance for the normal case ("I'm working on seven separate patches, three of which require Windows coverage, I pushed two of them three hours ago and I'm still waiting on Windows tests") in order to solve that. I don't watch Try very closely, what percentage of the load is from those cases?

(no longer active)

Reporter

Comment 3

•

13 years ago

(In reply to comment #2) > Are we solving "people push, see the result from the first platform, realize > they need to change foo, and push with foo changed without killing the first > push"? This seems like a massive annoyance for the normal case ("I'm working on > seven separate patches, three of which require Windows coverage, I pushed two > of them three hours ago and I'm still waiting on Windows tests") in order to > solve that. I don't watch Try very closely, what percentage of the load is from > those cases? This is the problem that we're trying to solve here. For the people who use try the most, the case you described does not seem to be the normal case at all, after eyeballing the try load for a time (not that I have more concrete evidence than that, of course.) Note that this will also serve to make people realize that there is a cost incurred on our shared infrastructure when they push something to the try server.

Mike Connor [:mconnor]

Comment 4

•

13 years ago

I don't think the normal case is "I'm constantly context switching between lots of different patches." I think the normal case I've observed is "My try run was unsuccesful, I should push a new version with fixes." Ideally this would be an interactive hook, but hg makes that hard/impossible.

Phil Ringnalda (:philor)

Comment 5

•

13 years ago

Are we going to alter the trychooser hg extension to totally defeat this hook by automatically repushing with the token, or are we going to make the extension prohibitively annoying to use if you are not wasteful of your time, and work on more than one patch at a time?

Mike Connor [:mconnor]

Comment 6

•

13 years ago

So the primary question, as with all bugs, is what is the outcome we want? My assertion is that we want to reduce unnecessary load on our infrastructure, where unnecessary is defined as "jobs that the developer does not actually need." The goal is not "create an arbitrary hoop to jump through" but "do the thing that makes sense for the situation." In this case, where we are trying to raise/force awareness is "re-pushing the same patch queue with changes without killing the now-obsolete jobs" and not "pushing multiple separate patch queues at once." I've seen relatively little evidence of the latter, and more of the former, but that's not really data, or even relevant. If our solution impairs the latter, that's a cost we should seek to remove. An alternative approach would be to have trychooser automatically generate a per-repo/branch/patchqueue token (which doesn't need to be universally unique, just per-user unique), and treating each user/token pair as a job key. If a user re-pushes with the same token (implying that they're building the same tree), we'd kill existing jobs with that key and start a new run. If specific platforms are specified in the second push, we should optimize and only kill jobs for those platforms (i.e. a re-push just for Mac would not kill previous Windows jobs). This would mean devs would be able to run multiple unrelated runs with little effort, but we would catch the "new push obsoletes previous push" cases automatically and free up resources without relying on devs to think it through.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 7

•

13 years ago

(In reply to Mike Connor [:mconnor] from comment #6) > In this case, where we are trying to raise/force awareness is "re-pushing > the same patch queue with changes without killing the now-obsolete jobs" and > not "pushing multiple separate patch queues at once." So why not just raise awareness, by producing a report on who's doing the former (the most) and contacting them?

:Ms2ger (he/him; ⌚ UTC+1/+2)

Comment 8

•

13 years ago

This is just pointless annoyance for people who work on more than one patch at a time... Do you think the people who don't even know their LDAP password will bother trying to get it reset because of this hook, if they didn't before?

(no longer active)

Reporter

Comment 9

•

13 years ago

(In reply to comment #8) > This is just pointless annoyance for people who work on more than one patch at > a time... Do you think the people who don't even know their LDAP password will > bother trying to get it reset because of this hook, if they didn't before? Hmm, I'm not sure what the LDAP password has to do with what's being suggested here!

:Ms2ger (he/him; ⌚ UTC+1/+2)

Comment 10

•

13 years ago

You can't cancel builds without one

Mike Connor [:mconnor]

Comment 11

•

13 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) (offline September 29-30 NZ time, i.e. 28-29 US time) from comment #7) > (In reply to Mike Connor [:mconnor] from comment #6) > > In this case, where we are trying to raise/force awareness is "re-pushing > > the same patch queue with changes without killing the now-obsolete jobs" and > > not "pushing multiple separate patch queues at once." > > So why not just raise awareness, by producing a report on who's doing the > former (the most) and contacting them? People have been trying to raise awareness for years, with limited success. It's not going to be a one-off process (new people always starting) or usefully effective (usage follows a long tail pattern, and I don't think worst offenders are necessarily the lion's share of unnecessary load). We can/will do this (catlee has a report) but I think that automagically handling things for people feels like a win.

Anthony Jones (:ajones, :kentuckyfriedtakahe, :k17e)

Comment 12

•

13 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #0) > We are interested in doing this in order to cut down the infrastructure load > caused by people pushing multiple patches to try without realizing that > their previous jobs are still active. If the assumption is that people are doing this without realising perhaps a better solution would be to dispatch an email notification.

(no longer active)

Reporter

Comment 13

•

13 years ago

(In reply to comment #12) > (In reply to Ehsan Akhgari [:ehsan] from comment #0) > > We are interested in doing this in order to cut down the infrastructure load > > caused by people pushing multiple patches to try without realizing that > > their previous jobs are still active. > > If the assumption is that people are doing this without realising perhaps a > better solution would be to dispatch an email notification. People already get emails from try. Clearly that is not working!

Phil Ringnalda (:philor)

Comment 14

•

13 years ago

Actually, they don't get emails from try. We changed the default to no email, which was a mistake that results in total failure pushes continuing to run until everything has failed, because "everyone" had already filtered all email from try to the trash. So step 1 in any solution that involves email is to send it from a new address.

Chris AtLee [:catlee]

Updated

•

13 years ago

Assignee: catlee → nobody

Priority: -- → P4

Chris Cooper [:coop] (he/him)

Comment 15

•

13 years ago

Any resolution here? Do we need to create/promote/enforce a set of try best practices?

Component: Release Engineering → Release Engineering: Automation (General)

OS: Mac OS X → All

QA Contact: catlee

Hardware: x86 → All

Whiteboard: [tryserver][capacity][hg][hook]

Phil Ringnalda (:philor)

Comment 16

•

13 years ago

My impression from a brief conversation with dumitru is that what I suspected was the case (from people sometimes asking for someone to cancel or retrigger jobs for them because they didn't think they had an ldap account) is indeed the case: we never made it clear to IT that whenever someone gets access to push and has an ldap account created for them because of that, they do need to be told that it was created and need to be told the password, because using try calls for using self-serve and thus requires using an ldap password. Those careless people who aren't cancelling their try jobs when they've obviously gone bad, and aren't cancelling one job when they push a revised patch? They don't know that they can cancel, and they don't know that they should be filing a bug asking to be told their password.

Nobody; OK to take it and work on it

Assignee

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

:kanban-engops

Updated

•

11 years ago

Whiteboard: [tryserver][capacity][hg][hook] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2219] [tryserver][capacity][hg][hook]

:kanban-engops

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2219] [tryserver][capacity][hg][hook] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2229] [tryserver][capacity][hg][hook]

:kanban-engops

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2229] [tryserver][capacity][hg][hook] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2233] [tryserver][capacity][hg][hook]

bhearsum@mozilla.com (:bhearsum)

Comment 17

•

10 years ago

I think we're way past the point of this being viable.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Assignee

Updated

•

8 years ago

Component: General Automation → General